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eoiisly map and sequence the human ge- 
nome by means of end sequences from 150- 
kbp bacterial artificial chromosomes (BACs) 
(77, 18), The end sequences spanned by 
known distances provide long-range continu- 
ity across the genome. A modification of the . . 
BAG end-sequencing (EES) method was ap- 
plied successfully to complete chromosome 2 ; ; 
from the Arabidopsis thaliana genome {19). i. 

In 1997, Weber and Myers (20) proposed . 
. whole-genome ■ shotgun sequencing of ^ the ^ 
human genome. Their proposal was not well 
received (27). However, by early 1998, as v 
Jess than 5% of the genome had been se- 
quenced, it was clear, that the rate of progress 
in . human genome sequencing worldwide ; : 
was very slow (22), and the prospects for 
finishing the genome by the 2005 goal were 
uncertain. 

In early 1998, PE Biosystems (now Applied 
Biosystems) developed an automated, high- : • 
throughput capillary DNA . sequencer, • subse- : 
quently called the ABI PRISM 3700 ; DNA : 
Analyzer. Discussions between PE Biosystems 
and HGR scientists resulted in a plan to under- 
take the sequencing of the human genome with ; 
; the 3700 DNA Analyzer and the whole-genome 
shotgun sequencing techniques developed^ at 
- HGR (25). Many of the principles of operation 
of a genome-sequencing facility were , estab- - . 
lished in the TIGR facility (24). However, the 
facility envisioned for Celera would have a : 
capacity roughly 50 times that of HGR, and 
thus new developments were required for sam- 
ple preparation and tracking and for whole- 
genome assembly. Some argued that the re- 
quired 150-fold scale-up from the H. influenzae 
genome to.the human genome with its complex 
repeat sequences was not feasible (25). The 
Drosophila melanogaster genome was thus 
chosen as a test case for whole-genome assem- 
bly on a large and complex eukaryotic genome. 
In collaboration with Gerald Rubin and the 
Berkeley Drosophila Genome Project, the nu- 
cleotide sequence of the 120-Mbp euchromatic 
portion of the Drosophila genome was deter- 
mined over a 1-year period (26-28). The Dro- 
sophila genome-sequencing effort resulted in 
two key findings: (i) that the assembly algo- 
rithms could generate chromosome assanblies 
with highly accurate order and orientation with 
substantially less than 10-fold coverage, arid (ii) 
that undertaking multiple interim assemblies in 
place of one comprehensive final assembly 
not of value. . ■ . ^ 

These findings, together with the dramatic 
changes in the public genome effort subsequent 
to the formation of Celera (29), led to a modi- 
fied whole-genome shotgun sequencing ap- 
proach to the human genome. We initially pro- 
posed to do 10-fold sequence coverage of tiie 
genome over a 3-year period and to make in- 
terim assembled sequence data available quar- 
terly. The modifications included a plan to per- 
form random shotgun sequencing to —5-fold 
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coverage and to use the unordered and unori- : 
ented BAG sequence Segments and subassem- 
blies pubHshed in GenBank by,the publicly 
funded genome effort (30) to accelerate the . 
project We also abandoned the quarterly an-,; : 
. nouncements in the absence of interim assem- vy 
vMes to report. - f ■ > 

; : - . .Al^ough;:this :strategy ^p^^ a reason-;- 
;able result very early that was consistent with a ; 
whole-genome shotgun , assembly with.- eight- 
. fold coverage, the human genome sequence is / 
I not as finished as iht Drosophila genome was 
with an effective 13-fold coverage. However, it 
became clear that even with this reduced cov- 
erage strategy, Celera could genOTte an .accu- - 
■ rately ordered and oriented scaffold sequence of . 
vthe human genome in less than 1 year. Human : 
genome sequencing was initiated 8 September ; 
.,-1999 and completed 17. June 2000. ;-The first 
assembly was completed 25 June 2000, and the 
. assembly reported here was completed 1 Octo-' 
ber 2000. Here we describe the whole-genome 
^random shotgun sequencing effort applied to 
the human genome. We developed two differ- 
ent assembly approaches for assembling the ---S 
\ billion.bp that make up the 23 pairs of chromo- r 
somQs ofi^t Homo sapiens genome. Any Gen- 
Bank-derived data were shredded to remove 
/potential bias . to the final sequence. from chi-: 
• meric clones, foreign DNA contamination, or 
imisassembled contigs. ;insofer -as a . correctly 
and accurately c assembled ' genome sequerice 
with faithfiil order and orientation of contigs 
is essential for an accurate analysis of the 
human genetic code, we have devoted a con- 
siderable portion of this manuscript to the 
documentation of the quality of our recon- 
struction of the genome. We also describe.our 
preliminary analysis of the human genetic 
code on the basis of computational methods. ; 
Figure 1 (see fold-out chart associated with ■ 
this issue; files for each chromosome can be 
found in Web fig. 1 on Science Online at 
www.sciencemag.org/cgi/content/full/291/ 
5507/1304/DCl) provides a graphical over- 
view of the genome and the features encoded 
in it. The detailed manual curation and inter- 
pretation of the genome are just begmning. 

To aid the reader in locating specific an- 
alytical sections, we have divided the paper 
into seven broad sections. A summary of the 
major results appears at the beginning of each 
section. 

1 . Sources of DNA and Sequencing Methods 

2 Genome Assembly Strategy and 
Characterization 

3 Gene Prediction and Annotation 

4 Genome Stmcture 

5 Genome Evolution 

6 A Genome-Wide Examination of 
Sequence Variations 

7 An Overview of the Predicted Protein- 
Coding Genes in the Human Genome 

8 Conclusions 



1 Sources of DNA and Sequencinp 
Methods ^ 

- Summary. ,This section discusses the rationale 
>and ethical mles governing donor sclcciion lo 
, ensure ethnic and gender diversity along xs-ju, 
-;.the methodologies for.DNA extraction and ■ 
/brary .-construction. .The plasmid library cory. 

, stmction is the -first • critical step in shotgun ■ 
sequencingMf the DNA libraries are not uni* 

: form in size, npnchimeric, and do not randomly ' 

; , represent the genome, then the subsequent slcpi ' 
cannot accurately reconstruct .the genome sc- 
.quence. We used automated high-throughpm ■ 
DNA sequencing and the computational infra* 

> stmcture ,tOi enable, efficient ^ tracking of cnor* : . 

; mous amourits of sequence . information (27.3 ; 

.million sequence reads; 14.9 billion bp of se- 
quence). Sequencing and tracking from both ' 
ends of plasmid clones from 2-, 10-; and 50-kbp 
libraries ; were -.essential to the computational 
reconstruction of the genome. Our evidence 

- indicates that the accurate pairing , rate of .end - 
'sequences was greater than 98%. : 

: Various policies of the United States and the 
.Worid Medical Association; specifically the . 
Declaration of Helsinki^ offer recommcnda- 
. .tions for conducting experiments with human 
. subjects.- We convened , an Institutional Re- 
view Board, (ERB) (31) that helped us estab- 
• lish the protocol for obtaining and using hu- > 
: man DNA and the informed consent process 
. used to i enroll , research .volunteers for tiic 
DNA-sequencing studies reported here. We 
adopted several steps and procedures to pro- 
tect the privacy ri^ts and confidentiality of 
the research subjects (donors). These includ- 
ed a two-stage consent process, a secure ran- 
dom alphanumeric coding system for spcci- 
. mens and records, circumscribed contact with 
the subjects by . researchers, and options for 
off-site contact of donors. In addition, Celera 
applied for and received a Certificate of Con- 
fidentiality firom the Department of Health 
and Human Services. This Certificate autho- 
rized Celera to protect the privacy of tiic 
individuals who volunteered to be donors as 
provided in Section 301(d)^of the Public 
Health Service Act 42 U.S.C. 241(d). 

Celera and the IRE believed that the mi- 
tial version of a completed human genome 
should be a composite derived fi-om multiple 
donors of diverse ethnic backgrounds Pro- 
spective donors were asked, , on a voluntary 
basis, to self-designate an ethnogeographic 
category (e.g., Afiican-American, Chinese. 
Hispanic, Caucasian, etc.). We enrolled 21 
donors (52). 

Three basic items of information from 
each donor were recorded and linked by con- 
fidential code to the donated sample: age. 
sex, and self-designated ethnogeographic 
group. From females, --130 ml of whole, 
heparinized blood was collected. From males, 
-130 ml of whole, heparinized blood was 
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nome, and even a modest error rate can 
reduce the effectiveness of . assembly. In 
addition, maintaining the validity of mate- . 
pair information is absolutely critical for - 
the algorithms described below. Procedural 
controls, were established for maintaining 

•the validity of sequence mate-pairs as se-V' 
quencing reactions proceeded through the:; ^^ 

• process, including strict rules built into the ; ; 

ILIMS. The accuracy of sequence data pro- 
duced by the Celera process was validated. , 
in the course of . the Drosoph ila , genome ■ ; j 
project (25). By collecting data for the. 



entire human genome in a single facility,' 
we were able to ensure uniform - quality 
r standards and the cost advantages associat- ; 
ed with automation, an economy .of scale,/, 
and process consistency. . . " ' - 

2 Cenonrie Assembly Strategy and ' > . . . 
' Characterization . ; ; • vrv * ^ 

'/Summary, We describe in . this section the two 
^ approaches that, we'tised to assemble ;the ge-. ; 
■ nome. One method involves the computational 
combination of all sequence reads with shred- / 
. ded data from GenBank to generate.an indepen-. : 



dent, nonbiased view of the genome. Hie sec- 
;ond approach involves clustering all of the frag- 
ments to a region or chroniosome on the basis 
of -mapping 'information- The clustered data 
• • were then shredded and subjected to computa- 
/^'^tional assembly. : Both approaches' provided es- 
. ..sentially the. same reconstruction of assembled 
r.' DNA sequence* with proper -order, and brienta- 
;,ition..l;:The ; second method Sprovided .vslightly 
greater sequence coverage ;(fewer gaps) and 
: :y/as the principal sequence used for the analysis 
- .phase. In addition,' we. document the complete- 
/ness and porrectness of .this 'assembly process 
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Fig. 2. Flow diagram for sequencing pipeline. Samples are received, 
selected, and processed In compliance with standard operating proce- 
dures, with a focus on quality within and across departments. Each 
process has defined inputs and outputs with the capability to exchange 



samples and data with both internal and external entities according to 
defined quality guidelines. Manufacturing pipeline processes, products, 
quality control measures, and responsible parties are Indicated and are 
described further in the text. 
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. . irfbrmation was ignored because some BACs • at least 22% of the l^^^'^fJ^^^'S^s ^ 
were not correctly placed on the PFP physical data that were not part of the given BAG (41), 
;^^?Se we Sd strong evidfnci that v .possibly as a result of sample-^cbng errors. . 

Table 2. CenBank data input into assembly. '• 
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Center 



Statistics ' 
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;1:and Z ^i- 



Whitehead Institute/ 

MiT Center for 
. . Genome Research^ 
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Washington University, 
USA 



2.825 : 6,533/ 
243,786 - ^ : -138.023 ^ 
: 1 94.490.1 58 ■ -1,083.848,245 
^M.553.597- - ^875,618 ' 
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1v .98,028 



.798 

■ \ AS 
:. 2,127. 
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21,604 
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Baylor College of 
Medicine. USA 



Number of accession records^ 
Number of contigs : / 
Total base pairs ; * '■'■}■[ 
Total vector masked (bp) - 
Total contaminant masked ■ ■'■ 

. (bp) 

•.Average contig length (bp) 
Number of accession records . 
Number of contigs 
Total base pairs 
Total vector masked (bp) ; 
Total bontaminant masked 

(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) ; 
Total contaminant masked . ■ 
(bp) 

Average contig length (bp) 
Number. of accession records 
Number. of contigs . 
'Total base pairs .;: • 
Total vector masked (bp) f 
. .Total contaminant masked 

(bp) 

Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked (bp) 
Average contig length (bp) 
Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Tdtal contaminant masked (bp) 
Average contig length (bp) 

Number of accession records 
Number of contigs 
Total base pairs 
Total vector masked (bp) 
Total contaminant masked 
(bp) 

Average contig length (bp) 

Number of accession records 
Number of contigs 
. Total base pairs 
■ . Total vector masked (bp) . 
Total contaminant masked 
(bp) 

Average contig length (bp) 
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Southwestern Medical Center. University of Washington. tTlie 4.405.700.8Z5 bases conir 
shredded into faux reads resulting in 2.96X coverage of the genome. 
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0 
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74,324 


0 


689.059.692 


0 


427.326 


0 


2.066.305 


0 


9.271 


42 


1.894 


5,978 


29,898 


5,564.879 


283,358.877 


57.448 


279,477 


575,366 


1.616.665 


931 


9,478 


3.021 


21,015 


. 258,943 


409.628 
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3,360,047.574 
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2.438.575 


14,918.135 


16.311.664 


811 


8.203 



r:(see below), in short, we perfonned a true, ab \ 
•initio whole-genome assembly in which i 
■took the "expedient of deriving additional sc- i 
quence coverage, but not mate pairs, assembled 
bactigs, 'or genome locality, from some cxict- 
vnally. generated data. 

• V- In the compartmentalized;shotgun assembly 
. (CSA), Celera and PFP data .were partitioned 

into the largest possible chromosomal segrncnti 
or "components" that could be deterniiiied wiih 
;confidence;'and then shotgun assembly was np- 
plied to ■ each partitioned ..subset wherein' fhc 
bactig data were again shredded into faux rcadi 
to ensure an independent ab. initio assembly of 
the component. By stibsetting the data in this 
\vay, the overiaU - computational effort was rc- 

• diiced and the effect of interchroinosomal dupli- 
; cations was ameliorated. This also resulted in a 

reconstruction of tiie genome that was relatively 
" independent of flie whole-genome assembly re- 
sults so that the two assemblies could be com- 
.. pared for consistency. The quality of the parti- 
- tioning 'into : components .was -crucial so thai 
different genome regions were not mixed to- 

• gether; We constructed components from (i) ilic 
longest. scaffolds of the sequence from eacli 
BAG Mid (ii) assembled scaffolds of data unique 
to Celeia's data set The BAG assemblies were 

i obtained by a combining assembler that used the 

• bactigs arid the 5X Celera .data niapped to Uiosc 
V ibactigs as'input 'This effort was undertaken i.s 
-an interim-step solelybecaiisethe'moreaccuro^ 

and complete the.scaffold for. a given sequence 
■ stretch," Ae more accurately one can tile llicsc 
scaffolds into contiguous components on "ii: 
basis of sequence overiap and mate-pair infoi. 
mation. We further visually inspected and cii- 
rated the scaffold tiling of the components to 
fidher increase its accuracy. For the fmal CbA 
assembly, all but the partitioning was ignored. 
... and an independent, ab initio reconstruction oi 
: the sequence in each component was obta inca 
bv applying our whole-genome assembly nifco- 
KSe partitioned. relevant Celem data an 

the shredded, faux reads of tiie partitioned, 
evant bactig data. 

2 3 Whole-genome assembly 

The algorithms used for whole-gcnomc _»s- 
sembly (WGA) of the human 8^"°" J' , ^ 
enhancements to those used to prod"cc • 
sequence of the Drosophila genome icportc 

in detail in {2S). „inclinf 
The WGA assembler consists of ? P'l^ "-.^ 
composed of five principal stages: Scix-t | 
Overiapper. Unitiggbr. Scaffolder. and l« 
Resolver. respectively. The Screcnc 
and marks all microsatellite repeats w ^ 
than a 6-bp element, and screens ou ^ 
known interspersed repeat e ements. i 
ing Alu, Line, and ribosomal DNA. M.^ , 
regions get searched for overlaps. wl> 
screened regions do not get sea«:h<^d. but 
be part of an overlap that mvolves unscrtc 
matching segments. 
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some 22, all stones were placed correctly. 
The final method of resolving gaps is to 

fill them with assembled BAG data that cover 
l{ the gap. We call this external gap talking." 
■ We did not include the very aggressive "Feb-. 

bles" . substage described in our Drosophila 
V . work, , which made enough mistakes so . as, to , 
. produce repeat reconstructions for long inter- 

• spersed elements whose quality was ;only 
, 99.62% correct. We decided that for. the ;hu- .: 
\man genome it was philosophically better not 

, to introduce a step that was certain to produce 
/ .less.tfian^99.99% accuracy. The cost was. a. 
: .somewhat larger number of gaps, of some- 
what larger size. 

• ' At the fmal stage of the assembly process, 
and also, at several intermediate points, .a. 
consensus sequence of every contig is pro-.; 
duced. Our algorithm is driven by the princi- 
ple of maximum parsimony, with qualityr .; 
value-weighted measures for evaluating each 
base. The net effect is a Bayesian estimate of . 

^ the correct base to report, at each position. ■ 
" Consensus generation uses Celera data when-, ; 
ever it is present. In the event that no Celera ■ 
data cover a given region, the B AC, data j 
sequence is used. . • • , , , v v • 

. Akey element of achieving a WGA of the 
human genome was to parallelize the Overlap- , 
per and the central consensus sequencefcon- ;. 
istmcting subroutines. In addition, memory was / 
a .real issue — ^a straightforward application, of 
the software we had built for Drosophila would 



! .' i. ..have, required a ; computer with a- 600-gigabyte 
. . RAM. By making the Overiapper and Unitigger 
incremental, we .were able to achieve the. same. 
{ computation with a maximum of instantaneous 
• • ;. usage of 28 gigabytes of RAM. Moreover, the 
:; incremental nature of the first three, stages al- 
l' k:. .lowed us to. continually update the state of this 
. ' part of the , computation as . data were delivered 
; :;: ;and then perform,a .7-day. run to complete Scaf- - 
y: folding land .Rqjeat . Resolution whenever .de- 
sired. . For oxir assembly "operations, the total : 
V compute infiastnicture consists of 10 four-pro- 
cessor SMPs . with 4 gigabytes of memory per 
: cluster (Compaq's ES40, Regatta) aiid a 16- 
,; . processor NUMA.machine with 64 gigabytes 
:.:;.of memory (Compaq's, GS160, WUdfi^ The. 
■ /"'total compute, for. a. run. of the. assembler was 
.'roughly20,000 CPU hours. 
. ^- The ^assembly of Celera's data, together 
: . , with the shredded bactig data, produced a set of 
^scaffolds totaling 2.848 Gbp in span and con- 
-.: sisting of 2.586 Gbp of sequence; The chaff, or 
: ;■ set of reads not incorporated in :thej assembly, \ 
.■• ■.numbered 11:27 million (26%), which is con- 
:^ sistent with our experience for Drosophila. 
. . More, than . 84% of the genome was covered by 
scaffolds :>1 00. kbp long, and these -averaged - 
91% :sequence iand 9% gaps with a total of 
■. 2.297 Gbp of sequence. There were a total of 
93,857 gaps among -the 1637. scaffolds >100 
. :;Jcbp. ,. The. average scaffold size, was ■ 1.5 .Mbp, 
■.y. the average wntig size was 24.06 kbp, arid the- . 
;.;average gap size w^ 2.43 kbp, where the dis-. . 



- tributioh of each: was essentially exponential 
. * More than 50% of all gaps were less than 50C 

■ bp long, >62% of all gaps were less than 1 kb; 

vlong, and no gap was >100 kbp long. Similar 
o ly,-more than 65% of the sequence is in contig. 
. ^ >30 kbp, mbre.than 31% is in contigs >10C 
■\ kbp, and the largest . contig .was : 1.22 Mbp long. 
> ;■ Table 3 ,'gives: detailed .summary .'; statistics for . 
: -the ^structure .of^this ';assembly.: with 'a.* direct 

• /..comparison to. the. compartmentalized shotgun 

assembly.' ./ 

2.4 CompartmentaUzed shotgun 
assembly 

In addition to the WGA .approach,. 'we pur- 

• sued a localized assembly approach that was 
intended to subdi\ade the; genome" into seg- 
ments, each of which could be shotgun as- 
;sembled individually. We .expected that this 

*; would help in resolution of large interchro- 
mosomal duplications and improve the statis- 
: tics, for calculating' U-imitigs. , The compart- 
i-^mentalized assembly vprocess ':irivolved clus- 
tering Celera reads . and bactigs into large, 
multiple megabase .regions of the genome, 
: and then running the WGA assembler on the 
Celera ^ data ' and . shredded, faiui reads ; ob- 
tained from the bactig data. 
: : ;The first phase of the ,CS A strategy was to 
. separate Celera reads; into 'those', that matched 
. the BAC contigs ^r:a".paLrticul^ iPFPi^AC 
- entry, and thpse that did notmatch any public 
\ data. ; Such matches : must .be. gjuarahteed to 



Table 3. Scaffold statistics for whole-genome and compartmentalized shotgun assemblies. 



No. of bp In scaffolds 
(including Intrascaffold gaps] 
■ No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps £1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average Intrascaffold gap size 
(bp) 

Largest contig (bp) 
% of total contigs 

No. of bp In scaffolds 

(including Intrascaffold gaps) 
No. of bp in contigs 
No. of scaffolds 
No. of contigs 
No. of gaps 
No. of gaps :S1 kbp 
Average scaffold size (bp) 
Average contig size (bp) 
Average Intrascaffold gap size 

(bp) 

Largest contig (bp) 
% of total contigs 



Scaffold size 


All 


>30kbp 


>100kbp 


>56bkbp 


" ^ > 1000 kbp 




Compartmentalized shotgun assembly 






2,905.568.203 


2,748.892,430 


2.700.489.906 


V 2,489.357.260 


2,248,689.128 


2,653,979,733 
53.591 
170,033 
116,442 
72,091 
54,217 
15.609 
2,161 


2.524.251.302 
2.845 
112,207 
109.362 
69,175 
966.219 
22.496 
2,054 


2.491,538,372 
1.935 
107,199 
• 105,264 
67.289 
1.395,602 
23,242 
1,985 


2.320.648,201 
1.060 
93.138 
92.078 
59.915 
2.348.450 
24.916 
1.832 


2.106,521.902 
721 
82.009 
81,288 
53.354 
3.118,848 
25.686 
1.749 


1.988.321 
100 


1.988.321 
95 

Whole-genome assembly 


1.988.321 
94 


1.988.321 
87 


1.988,321 
79 



2.847.890,390 


. 2.574,792.618 , 


... 2,525.334,447 


2.328.535,466 . 


. 2.140.943.032 


2.586.634.108 ; 


: 2,334.343.339 


. 2.297,678,935 


. 2.143,002,184 


1,983.305.432 


= 118,968 


2,507 


1.637 


818 


554 


221,036 


99.189 


95.494 


84.641 


76.285 


102.068 


' 96,682 


93.857 


83.823 


75,731 


62.356 


60.343 


59,156 


54.079 


49.592 


23.938 


1.027,041 


1.542,660 


2.846.620 


3.864.518 


11.702 


23,534 


24.061 


25.319 


25,999 


2.560 


2.487 


2.426 


2.213 


2.082 


1.224.073 


1,224.073 


1.224.073 


1.224,073 


1.224,073 


100 


90 


89 


83 


77 
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not covered by . a matching segment in the 
. other assembly. Some 82.5 Mbp of the WGA 
(3.95%) was not covered by the CSA, where- 
as 204.5 Mbp (8.26%) of the CSA was not 
covered by the WGA. This estimate did not 
require any consistency of the assemblies or 
any imiqueness ; of the matching segments. . 
Thus, another analysis was conducted in , 
which matches of less than .1 kbp between a > 
; pair of scaffolds were excluded unless they - 
were confirmed by other matches, having a ; 
consistent order and orientation. This gives ; . 
.some measure of consistent coverage:: 1.982 , 
Gbp (95.00%) of the WGA is covered by the 
CSA, and 2.169 Gbp (87.69%) of the CSA is 
covered by the WGA by this more stringent 
measure; 4' . , . ' ■ 

. The comparison of WGA to CSA also 
permitted evaluation of scaffolds for structur- 
al inconsistencies. We looked for instances in ,. 
which a large section of a scaffold firom one ■ 
assembly. matched only one scaffold fi:om the . 
other assembly, but failed to match over the 
: full . length of the - overlap implied by. the ■ ^ 
matching segments. An initial set of candi- . . 
. dates was identified automatically, and then 
: each candidate was inspected by hand. From . > 
this process, we identified 31 instances in ; 
which the assemblies appear to disagree in a 
nonlocal fashion. These cases are being fur- 
ther evaluated to determine which assembly 
is in error and why. . ; 

In addition, we evaluated local inconsis- . . 
tencies of order or orientation. The following . ; 
results exclude cases in which one contig in 
one assembly corresponds to more than one 
overlapping contig in the other assembly (as 
long as the order and orientation of the latter 
agrees .with the . positions they match in the 
former). Most of these small rearrangements 
involved segments on the order of himdreds 
of base pairs and rarely >1 kbp. We found a 
total of 295 kbp (0.012%) in the CSA assem- 
blies that were locally inconsistent with the 
WGA assemblies, whereas 2.108 Mbp 
(0.11%) in the WGA assembly were incon- 
sistent with the CSA assembly. 
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The CSA assembly was a few percentage 
points better in terms of coverage and slightly 
more consistent than tiie WGA, because it 
' was in effect performing a few thousand shot- 
gun assemblies of megabase-sized problems, : 
: whereas the WGA is performing a shotgun. 
■ assembly of a gigabase-sized problem. When, 
. one considers the increase of two-and-a-half 
: orders of magnitude in problem size,- the in- v 
. formation loss between the two is remarkably : 
small.. Because CSA was logistically easier to 
- deliver and the better of the two results avail- ' 
•able at the time .when downstream -analyses 
; needed tp .be begun, all; subsequent analysis 
was performed on this assembly. : .: , . 

2.6 Mapping scaffolds to the genome 

' The final step in assembling the genome was to : 
^ order and orient , the scaffolds on the chromo- 
. somes. We first grouped scaffolds .together on 
the basis of their order in the components from 
CSA; These grouped scaffolds were reordered 
by examining residual , mate-pairing data be-: 
'tween the scaffolds.-We next mapped the scaf- 
fold groups onto the chromosome using physi- 
cal mapping data. This step depends on haying: 
reliable high-resolution map information such 
that each scaffold will overlap multiple mark- ■ 
ers. There are two genome-wide types of map 
infomiation available: high-density STS maps 
and fingerprint maps of BAC clones developed ; 
at Washington University (45). Among the ger- 
nome-wide; STS .maps, GerieMap99 - (GM99) . 
has the most niaikers and therefore was most ; 
useful for mapping scaffolds. The two different 
mapping approaches are complementary to one 
anodier. The fingerprint maps should have bet- 
ter local order because they were built by com- 
parison of overlapping BAC clones. On the 
other hand, GM99 shoiild have a more reliable ^ 
long-range order, because the framework mark- . ; 
ers were derived from well-validated genetics 
maps. Both types of maps were used as a 
reference for human curation of the compo- 
nents that were the input to the regional assem- 
bly, but they did not detennine the order of 
sequences produced by the assembler. 



5-10 Mb > 10Mb 



<30kb 30-50 kb 50-100 kb 100-500 kb 0.5-1 Mb 1-5 Mb 

Scaffold Size 

Fig. 5. Distribution of scaffold sizes of the CSA For each range of scaffold sizes, the percent of total 
sequence is indicated. 



V . -..In order to.determine the effectiveness of 
the fmgerprint . maps and GM99 for mapping 
scaffolds, we furst examined the reliability of 

-■ -these maps by comparison with large scaf- 

V folds. -.Only 1% of the STS markers on the 10 
. largest scaffolds (those >9 Mbp) were 
t mapped on a different chromosome on 

..GM99. Two, percent of the STS markers dis- 
: v; agreed in position .by ^ rriore than fiyQ .frarrie- 

V work v-bins.'vHoweveir, r- for ' the ■ ■ fingerprint 
> maps, ' a ^ 2% vchroniosbme. discrepancy was 
: obseirved, arid; on . average . 23.8% of BAC 
t locations in. the scaffold - sequence disagreed 
; with fingerprint map placement by more than 

five BACs. .When further - examining the 
source of discrepancy, it was found that most 
v of ,the discrepancy came from 4 of the 10 
■ > scaffolds; indicating this there is variation in 
j'the quality of either the map or the scaffolds. 
.All four scaffolds were assembled, as well as 
the other six, as judged by clone coverage 

V analysis, and showed the same low discrep- 
:. ^ancy rate to :GM99i" and thus .we. concluded : 
: -that the fingerprint map global order in these 

; cases was not reliable. Smaller scaffolds had 
:> a higher discordance rate with GM99 (4.21% 
. of STSs were discordant by more, than five 
■ framework bins), but a lower discordance rate 
J with the fingerprint maps (11% of BACs 
:disagreed with firigerprint maps by moire thati '.. 
; five BACs). This observation agrees with the 
■clone coverage analysis (¥5) that Celera scaf-, 
\:fbid construction :was ^better^supportedjby 
. ^ long-range mate pairs in larger scaffolds thian ' 
in small scaffolds. 

We created two orderings of Celera scaf- 
folds on the basis of the markers (BAG or 
STS) on these maps. Where the order of 
scaffolds- agreed between GM99 and the 
^ .WashU BAC map, we had a high degree of 
v. confidence, that that order was correct; these 
- scaffolds : :were^ termed "anchor scaffolds." 
Only scaffolds with a low overall discrepancy 
rate with both maps were considered anchor 
scaffolds. Scaffolds in GM99 bins were al- 
lowed to permute in their order to match 
WashU ordering, provided they did not vio- 
late their framework orders. Orientation of 
individual scaffolds was determined by the 
presence of multiple mapped markers with . 
consistent order. Scaffolds with only one 
marker have insufficient information to as- 
sign orientation. We found 70.1% of the ge- 
nome in anchored scaffolds, more than 99% : 
of which are also oriented (Table 4). Because '■ 
GIA.99 is of lower resolution than the WashU 
map, a number of scaffolds without STS 
matches could be ordered relative to the an- 
chored scaffolds because they included se- 
quence from the same or adjacent BACs on 
the WashU map. On the other hand, because 
of occasional WashU global ordering dis- 
crepancies, a number of scaffolds determined 
to be "unmappable" on the WashU map could 
be ordered relative to the anchored scaffolds 
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V sembly. against other finished sequence for = 
determining sequencing accuracy at the nu- 
cleotide level, although this has been done for 
identifying polymorphisms as described in . 
Section 6. The accuracy of the consensus 
sequence is at least 99.96% on the basis of a . ; 
statistical estimate derived from the quality ; 
values of the underlying reads. . 
; : ■• The structural consistency.of the assembly . ^ 
can be measured by mate-pair analysis. In a 

.'.correct assembly, every , mated, pair of se-;.- 

.. .quencing reads should be located on the con-: 

, . sensiis sequence .with . the . correct separation i 

viand orientation :between the pairs. A pair :is 
termed *Valid" when' the reads are in the . 

/ correct orientation, and the distance between : . 

.'them is within the mean ± 3 standard devi--; 
ations of the distribution of insert sizes of the . ; 
library from which the pair was sampled. A : . 
pair is termed "misoriented" when the reads 

: are not correctly oriented, and is termed "mis- 
separated** when the, distance between the:- 

. reads is not in the correct range but the reads - 
are correctly oriented. The mean ± the stan- 
dard deviation , of each library used by the 
assembler . was determined - as -described r 
above. To validate these, we examined, all - 
reads mapped to the finished sequence of : 
chromosome 21 (^<?) and determined how 
many incorrect mate pairs there were as a 
result of laboratory tracking errors and chi- : i 

: merism (two . different segments of the ge- 1 
nbme cloned into the sarhe plasmid), and how 
tight the distribution of insert sizes was foi* 



THE HUMAN CENOME 

those that were correct (Table.' 5). Thcstan--^ 
/ dard deviations for all Celera libraries were > 

quite small,- less than . 15% of ; the insert r 

length, with the exception of a few 50-kbp. 
. libraries. The 2- and lO-kbp libraries ^con- 
, tained less than 2% invalid mate pans, where-, :: 
; : as the 50-kbp libraries were somewhat higher 

:(r 10%). Thus, although the mate-pair infor- i - 
:-. mation was not perfect, ife acciiracy . was such- i 
i^tiiat measuring valid, misoriented, :and.mis-v;V 
.; separated pairs with respect to a given assem- ;v 
; bly ;\yas deemed . to. be. a reliable instrument 
v'for validation purposes, especially when sev- • . , 
, eral mate pairs confirm or deny, an ordering. ^ 
- iy. The clone coverage , of the genome ;Was 
: 39X,;.meanihg that any given base.pair was, 
.on average, contained . in 39 clones or, equiv- .. ■ ; 
. alently,> spanned, by , 39 mate-paired reads. 

Areas of low clone coverage or areas with a ^ 

high proportion of invalid mate pairs would ; 

indicate potential assembly problems. . We 

.computed the; coverage of each base in the 
•iassenibly, by valid ..mate . pairs (Table 6). In , 

sunimary, for scaffolds >30 kbp in length, 
. less than 1% of the Celera assembly was in - 
.regions of less than 3X clone coverage. Thus, \x 

more than 99% ; of the assembly, including 

order and orientation, is. strongly supported 

by this measure alone. 
. ; .We examined the locations and number of. 

all misorierited and ■ misseparated ■ mates. In '>i 
•i.addition to doing this analysis : on the CS A \^ 
, assembly ,;(as : of. I .October 2000), we also v 

performed a study of the PFP assembly as of 



v.tS ASeptembei: 2000 - (50, J56). In this latter 
, caise, Celera mate pairs had to be mapped to 
.;the PFP assembly. To avoid mapping errors 
V due to: high-fidelity repeats, the only pairs 
mapped were those for which both reads 
.^ matched at only one location with less than 
.t' 6% differences. A threshold was set such that 
, , sets.of five or more simultaneously invalid 
: ;,mate pairs indicated a -potential-breakpoint,^ 
.' where the constmction of the twb assemblies 
^ differed. The graphic comparison of the CSA 
■ ■ chromosome 21 assembly with the published 
• sequence (Fig. 6 A) serves as a validation of 
this methodology. Blue tick marks in the 
panels indicate breaIq)oints. . There were a. 
.similar : (small) number of ^ breakpoints on 
-liboth chromosome - sequences. The exception 
-was 12 sets of scaffolds in the Celera assem- 
bly (a total of 3% of the chromosome length 
■r' in .'212 single-contig scaffolds) : that were 
. mapped to the wrong positions because they 
were too small to be mapped reliably. Figures 
; 6 and 7 and Table; 6 illustrate. the mate-pair: 
differences and breakpoints between the two 
. assemblies.' There was a higher percentage of 
misoriented and misseparated mate pairs in 
the Jarge-insert libraries (50 kbp .and BAC, " 
ends) than in the small-insert libraries in both 
assemblies (Table. 6). The large-insert librar- 
ies are more likely to identify discrepancies ■ 
: simply because they span a larger segment of - 
the - genome.- -The .-graphic r:^ 
tween the two assemblies for chromosomes 8 
(Fig. 6, B and C) shows that there are many 



Table 5. Mate-pair validation. Celera fragment sequences were mapped to 
the published sequence of chromosome 21. Each mate pair uniquely 
mapped was evaluated for correct orientation and placement (number 



of mate pairs tested). If the two mates had incorrect relative orienta- 
tion or placement they were considered invalid (number of invalid mate 
pairs). 



Chromosome 21 



Genome 



Library 
type 



Library 
no. 


Mean 
insert 
size 
(bp) 


SD 
(bp) 


SD/ 
mean 
{%) 


. No. of 
mate 
pairs 
tested 


No. of 
invalid 
mate 
pairs 


% 
invalid 


Mean 
insert 
size (bp) 


SD 
(bp) 


1 


2,081 


106 


5.1 


3,642 


38 


1.0 


2,082 


90 


2 


1,913 


152 


7:9 


28,029 


413 


1.5 


1.923 


118 


3 


2,166 


175 


8.1 


4,405 


57 


1.3 


2.162 


158 


4 


11,385 


851 


7.5 


4,319 


80 


1.9 


11.370 . 


696 


5 


14,523 


1,875 


12.9 


7,355 


156 


2.1 


14.142 


1,402 


6 


9,635 


1,035 


10.7 


5,573 


109 


2.0 


9.606 


934 


7 


10,223 


928 


9.1 


34,079 


399 


1.2 


10,190 


777 


8 


64,888 


2.747 


4.2 


16 


1 


6.3 


65,500 


5,504 


9 


53,410 


5,834 


10.9 


914 


170 


18.6 


53.311 


5,546 


10 


52,034 


7,312 


- 14.1 


5,871 


569 




51.498 : 


6,588 


n ; 


52,282 


. 7,454 : 


14.3 


2,629 


. 213 


BA 


52,282 


7,454 


12 


46,616 


7.378 


15.8 


2,153 


215 


ib.o 


45.418 


9,068 


13 


55,788 


10.099 


18.1 


2,244 - 


249 


11.1 


53,062 


10,893 


14 


39,894 


5,019 


12.6 


199 


7 


3.5 


36.838 


9,988 


15 


48.931 


9.813 


20.1 


144 


10 


6.9 


47,845 


4,774 


16 


48,130 


4,232 


8.8 


195 


14 


72 


47,924 


4,581 


17 


106,027 


27.778 


26:2 


330 


16 


4.8 


152,000 


26.600 


18 


160,575 


54.973 


34.2 


155 


8 


5.2 


161,750 


27,000 


19 


164,155 


19.453 


113 


642 


44 


6.9 


176,500 


19,500 










102,894 


2,768 


2.7 
















(mean = 2,7) 









SD/ 
mean 
(%) 



2 kbp 
10 kbp 

50 kbp 



BES 



Sum 



4.3 
6.1 
7.3 

6.1 
9.9 
9.7 
7.6 
8.4 
10.4 
12.8 
14.3 
20.0 
20.5 
27.1 
10.0 
9.6 
17.5 
16.7 
11.05 
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gene boundaries. During this process, multiple... 
hits to the same region were collapsed to a . 
coherent set of data by tracking the coverage of. , 
a region. For example, if a group of bases was 
represented by multiple overiapping ESTs, the ; 
union of these regions matched by the set of , 
ESTs on the scaffold was marked as being 
supported by EST evidence. This resulted in a. .. 
series , of "gene bins," each of which was be- . . 
lieved to contain a single gene. One weakness of : . .; 
this initial implementation of the algorithm was ..: 
in predicting gene boundaries in regions of tan- =. 
demly duplicated genes. Gene clusters frequent- 
•■ ly resulted in homologous neighboring genes 
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, being joined together, resulting in an annotation, 
that artificially concatenated these gene models. 
. Next, known genes (those with exact match- 
.'. es of a full-length cDNA sequence to the ge- . 
nome) were identified, and the region corre- 
sponding to the cDNA was . annotated as a 
^predicted transcript. A subset of the -curat-, 
: ed human gene set RefSeq from the Nation- 
:■. al . .. Center for . Biotechnology . Information . 
-V(NCBI) was included as a data set searched in 
...the computational pipeline. If a RefSeq tran- 
. script matched the genome assembly for at least 
. 50% of its length at >92% identity, then the 
, SIM4 (63) alignment of the RefSeq transcript to 



' the region of the genome under analysis was 
promoted to the status of an Otto annotation. 

. Because the .genome sequence has gaps and 
sequence errors such as frameshifts, it was not 
always possible to predict a transcript that 
agrees precisely with the experimentally deter- 

: mined cDNA sequence. A total of 6538 genes 
in our inventory were identified and transcripts 
predicted in this way. 

: Regions that have a substantial amount of 
sequence similarity,' but do not match known 
genes, were analyzed by that part of the Otto 
system that uses the sequence: similarity in- 
formation to predict a transcript. Here, Otto 
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Fig. 6. Comparison of the CSA and the PFP assembly. 
(A) All of chromosome 21. (B) all of chromosome 8. 
and (C) a 1-Mb region of chromosome 8 representing 
a single Celera scaffold. To generate the figure. Cetera 
fragment sequences were mapped onto each assem- 
bly. The PFP assembly is indicated in the upper third 
of each panel; the Celera assembly Is indicated in the 
lower third. In the center of the panel, green lines 
show Celera sequences that are in the same order and 
orientation in both assemblies and form the longest 
consistently ordered run of sequences. Yellow lines 
indicate sequence blocks that are in the same orien- 
tation, but out of order. Red lines indicate sequence 
blocks that are not in the same orientation. For 
clarity, in the latter two cases, lines are only drawn 
between segments of matching sequence that are at 
least 50 kbp long. The top and bottom thirds of each 
panel show the extent of Celera mate-pair violations . 
(red, misoriented; yellow, incorrect distance between 
the mates) for each assembly grouped by library size. 
(Mate pairs that are within the correct distance, as 
expected from the mean library insert size, are omit- 
ted from the figure for clarity.) Predicted breakpoints, 
corresponding to stacks of violated mate pairs of the 
same type, are shown as blue ticks on each assembly 
axis. Runs of more than 10.000 Ns are shown as cyan 
bars. Plots of all 24 chromosomes can be seen in Web 
fig. 3 on Science Online at www.sciencemag.org/cgi/ 
content/full/291/5507/1304/DC1. 




S.I Mbp 5.2 Mbp 5 J Mbp 5.4 Mbp 5S Mbp 5.6 Mbp 5.7 Mbp 5.8 Mbp 
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bases flanking these regions). The other bases 
. in the region, those not covered by any homol- 
ogy evidence, were replaced by N's. This se- 
quence segment, vwth high confidence regions 

• represented by . the consensus genomic se- 
quence and the remainder represented by N's, 
was then evaluated by Genscan to see if a 

. consistent gene model could be generated. This . 
procedure simplified the gene-prediction task 
..by first establishing the boundaiy for the gene 
(not a strength of most gene-finding algq- 
. :rithms), and by ■ eliminating regions with no 
. supporting evidence. If Genscan returned , a 
plausible gene model, it was further evaluated 
•before being promoted to an "Otto" annotation. 

• The final Genscan predictions were.oflen quite 
.. different from the prediction that Genscan re- 

. turned on the same region of. native genomic, 
sequence. A weakness of using Genscan to 
refine the gene model is the loss of valid, small 
exons from the final annotation." • 
ThQ next step in defining gene structures 
based on sequence similarity was to compare 
each predicted transcript with the homology- 
based evidence that was used in previous steps 
to evaluate the depth of evidence for each exon 
in the prediction. Internal exons were consid- 
ered to be supported if they , were covered by 
homology evidence to within ±10 bases of 
their edges. For first and last exons, the internal 
edge was reqiiired to be within 1 0 bases, but the 

. . external edge was allowed greater, latitude to . 
allow for .5'. and . 3' .untranslated regions ; 
(UTRs). : To . be retained, a prediction for a 
multi-exon gene must have evidence such that 
the total number of **hits,'* as defined above, 
divided by the number of exons in the predic- 
tion must be >0.66 or must correspond to a 
RefSeq sequence. A single-exon gene must be 
covered by at least three supporting hits (±10 
bases on each side), and these must cover the 
complete predicted open reading frame. For ^ 
a single-exon gene, we also required that 
the Genscan prediction include both a start 
and a stop codon. Gene models that did not 
meet these criteria were disregarded, and 



Table 7. Sensitivity and specificity of Otto and 
Genscan. Sensitivity and specificity were calculat- 
ed by first aligning the prediction to the published 
RefSeq transcript, tallying the number {N) of 
uniquely aligned RefSeq bases. Sensitivity is the 
ratio of N to the length of the published RefSeq 
transcript Specificity is the ratio of N to the 
length of the prediction. All differences are signif- 
icant (Tukey HSD; P < 0.001). 



Method 


Sensitivity 


Specificity 


Otto (RefSeq only)* 


0.939 


0.973 


Otto (homology)! 


0.604 


0.884 


Genscan 


0.501 


0.633 



'Refers to those annotations produced by Otto using only 
the Sim4-polI$hed RefSeq alignment rather than an evi- 
dence-based Genscan predictioa tRef^rs to those 
annotations produced by supplying all available evidence 
to Genscan. 
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those that passed were promoted to Otto 
: .^predictions/ Homology-based .Otto .predic- 
. ; 4ions do not contain 3'. and 5' untranslated 
; ; sequence. Although three de novo gene-finding 

programs [GRAIL, Genscan, and : FgenesH 
. .(63)] were: run as part of the :computational 
■. analysis,, the results of these programs were not 
:.: directly. ,tised in making the , Otto: predictions, 
l otto predicted . .11,226 .^additional .genes .by 
. 0 means of sequence similarity. . : ; 

; 3.2 Otto validation : 

. To validate ' the Otto homology-based process 

,t,and the method that Otto uses to define the 

,: structures of known genes, ; we compared tran- 
: scripts predicted by Otto with their correspond- 

. *ing (and prjesumably correct) transcript from a 
set of 4512 RefSeq transcripts for which there 

:;was a unique . SIM4 alignment (Table .7).' In 
'Order to evaluate the relative performance of 
. Otto and Genscan, we made three comparisons. 
The first involved a determination of the accu- 

' racy of . gene models predicted by Otto Svith 
;:Only homology data other than the correspond- 
ing RefSeq sequence (Otto homology in Table 
.7). We measured the sensitivity (correctly pre- 

. . dieted bases divided by the total length of the 
-cDNA) and specificity (correctly predicted 

. bases divided by the sum of the correctly and 
incorrectly predicted bases). Second, we exam- 

: ined the sensitivity and specificity of the Otto , 

. predictions that were made solely with the Ref- 

::Seq sequence, which is the process .that Otto ; 

: uses to annotate, known genes (Otto-RefSeq);\ 
And third, we determined the accuracy of the 
Genscan predictions corresponding to these 

. RefSeq sequences. As expected, the alignment 
method (Otto-RefSeq) was the most accurate, 

. and Otto-hornology performed better than Gen- 
scan by both criteria. Thus, 6. l%"of mie RefSeq 

. nucleotides were not represented in the Otto-. : 

\refseq annotations and 2.7% of the nucleotides > 
in the Otto-RefSeq transcripts were not con- 
tained in the original RefSeq transcripts. The 
discrepancies could come firom legitimate 
differences between the Celera assembly 
and the RefSeq transcript due to polymor- 
phisms, incomplete or incorrect data in the 
Celera assembly, errors introduced by Sim4 
during the alignment process, or the pres- 
ence of alternatively spliced forms in the 
data set used for the comparisons. 

Because Otto uses an evidence-based ap- 
proach to reconstmct genes, the absence of 
experimental evidence for intervening exons 
niay iriadvertandy result iri a set of exons that 
cannot be spliced together to give rise to a 
transcript In such cases. Otto may "split genes" 
when in fact all the evidence should be com- 
bined into a smgle transcript We also examined 
the tendency of these methods to incorrectly 
split gene predictions. These trends are shown 
in Fig. 8. Both RefSeq and homology-based 
predictions by Otto split known genes into few- 
er segments than Genscan alone. 



3.3 Gene number 

v::iRecognizing4hat .the , Otto system is quite 
conservative, we used a different gene-pre- 
. diction strategy , in regions where the ho- 
mology evidence was less strong. Here the 
^ results of de novo gene predictions were 
... - used. For these genes, we insisted that a 
predicted transcript have at least two of the 
.' following types of evidence to be. included 
, : in the gene set for further analyisis: protein, 
: . human EST, rodent EST, or mouse genome 
.. . fragment matches. This fmal class of pre- 
. dieted genes is a subset of the -predictions 
; made by the three gene-finding programs 
. that were used in the computational pipe- 
line. ■ For these, there . .was not sufficient 
- sequence similarity, information for Otto to 
.attempt to predict a' -gene structure: The 
f three :■ de novo gene-finding programs re- 
' .r suited in \ about j- 1 55,695 predictions, of 
which ^76,410 were, nomedundant (non- 
overlapping- with one another). Of these, 
/ . 57,935 -did not; overlap , known - genes or 
' predictions made by Otto.' Only 21,350 of 
-the gene predictions that did not overlap 
Otto predictions were partially supported 
by at least one type of sequence similarity 
evidence, and 8619 were partially support- 
ed by two types of evidence (Table 8). 
TTie sum of this nxunber (21,350) and the 
: number of Otto aimotations (17,764), 39,1 14, 
- ' is near the upper limit for. the human gene 
; ; - complement .; As seen in Table 8,:if the re- 
: < quirement for .; other ; supporting evidence is 
made more stringent, this number drops rap- 
idly so that demanding two types of evidence 
reduces the total gene number to 26,383 and 
demanding three types reduces it to ~23,000. 
Requiring that a prediction be supported by 
' ■ all four categories of eviderice is too stringent 
^ because it would eliminate genes that encode 
v-;. novel proteins (member'' of currently imde- 
scribed protein families). No correction for 
pseudogenes has been made at this point in 
the analysis. 

In a further attempt to identify genes that 
were not found by the autoannotation process 
or any of the de novo gene finders, we ex- 
amined regions outside of gene predictions 
that were similar to the EST sequence, and 
where the EST matched the genomic se- 
quence across a splice junction. After correct- 
ing for potential 3' UTRs of predicted genes, 
about 2500 such regions remained. Addition 
of a requirenient for at least one of the fol- ^ 
lowing evidence, types— homology to iriouse 
genomic sequence fragments, rodent ESTs, 
or cDNAs — or similarity to a known protein 
reduced this niunber to 1010. Adding this to 
the numbers firom the previous paragraph 
would give us estimates of about 40,000, 
27,000, and 24,000 potential genes in the 
human genome, depending on the stringency 
of evidence considered. Table 8 illustrates the 
number of genes and presents the degree of 
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Examination of pericentromeric regions is 
ongoing. . 

, The remaining --80% of the genome,..the 
r euchromatic : component, is . divisible into G-, 
R-, and T-bands (67). These cytogenetic bands 
have been presumed to differ in their nucleotide 
composition and gene density, . although we/ 
, have , been unable to detenmine precise band 
boundaries at the molecular level. T-bands. ans;. 
- the most G+G- and gene-rich, and G-bands are 
G+C-poor (55), Bemardi has also offered a - 
; description of the euchromatin at the molecular ;■ 
.level as long stretches of DNA of differing base 
,: composition, termed isochores (denoted L, HI, 
H2, and H3), which are >300. kbp in length; 
(69). Bemardi defined the L (light) isochores as. 
G +,C-poor . (<43%), whereas the H (heavy) • 
isochores fall.into three G+C-rich classes rep-! 
resenting 24, . 8, and 5% of the genome. - Gene j 
/concentration has been claimed to be very, low ' ' 
in the L isochores and 20-fold more enriched in 
the H2 and H3 isochores (70. By examining 
contiguous 50-kbp^windows of G+C content J. 
.across the assembly,- we found that regions of 
G-l-C content >48% (H3 isochores) averaged 
273.9 kbp in length, those with G+C content 
between 43 and 48% (HI +H2 isochores) aver- 
aged 202.8 kbp in length, and the average span 
of regions with <43% (L isochores) was. 
1078.6 kbp. The correlation between G+C 
content and gene density was also examined in . 
pO-kbp windows along the assembled sequence ■ 
(T^ble 9 and Figs. 10 and 11). We found that; , 
the density of genes was greater in regions of 
high G+C than in regions of low G+C content, 
as expected. However, the correlation between 
G+C content and gene density was not as 
skewed as previously predicted (69), A higher 
proportion of genes were located in the G+C- 
poor regions than Had been expected. 

Chromosomes 17. 19, and 22, which have 
a disproportionate number of H3 -containing 
bands, had the highest gene density (Table 
10). Conversely, of the chromosomes that we 
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found to have the lowest gene, density, X, 4, 
1 8,- 13, and Y,;alsq have the fewest H3 bands. 
Chromosome =;15, which .also j has ;few. ,H3 
■/-.bands, .did not. have a particularly, low gene 
density^ in our analysis! In addition, chromo- 
, some 8, which \ve found, to have a low gene 
.density, do^s not appear, to be unusual in its 
. U3 banding. • : , . ^^ 

Hov/£i^ 

; -maminalian';g^^ consist oif pases of genes ^ 
: .;.:in othervWse: essen^ 

. pears that the.human genome does indeed con- 
• tain deserts, or large, gene-poor regions. If we 
: V define a desert as a region >500 kbp without a 

/gene, then we see that 605.Mbp,; or about 20% ■ 

■ of the genome, ;is in..deserts.^^;T^^ 

■ c uniformly distributed over, the various chromo- , 
/...somes. Gene-rich, chromosomes 17, 19, and 22 
. have only about 12% of their .collective 111 
V .Mbp in deserts, whereas gene-poor chromo- ^ 

somes 4, 13,. 18, and X have 27.5% of their 492 - 
. Mbp in deserts (Table 1 1). The apparent lack of • 
predicted genes in .these regions does not nec- l 
essarily imply that they are devioid of biological 
function. 

4.2 Linkage map 

. Linkage maps provide the basis for. genetic 
: analysis and are widely used in the study of the . 
inheritance pf . traits and in the positional clon- 
: ing of genes. The distance metric, centimorgans , 
(cM), is based on the recombination rate be- 
tween homologous chromosomes during meio- ; . 



SIS.. In general, .the rate of recombination in 
:;feniales is 'greater ;than . that m males, and this 
■ degree. of map expansion is not uniform across 
the genome.(72), .One.of the opportunities en- 
.. abled by a nearly complete genome sequence is 
.vto; produce the ultimate physical map, and to 
... ..vfuUy analyze its correspondence with two other 
, maps . that have been ; widely .used ,in;'.genbme 
i,v :ai>d. genetic^; a^^ 

^cytogenetic . inap? vThis %)iild Jcloise • tiie ■ loop ' 
.: between the mapping arid sequencing phases of 
. the genome project 

: ! =We mapped:the location of the markers 
. , that constitute the Genethon linkage map to 
; .the genome. .The. rate ,'of recombination, ex- 
. . pressed as : cM. per '.Mbp, ;was calculated for * 
/ ; 3 -Mbp windows as shown in Table 1 2.- High- 
• er, jates : of recombination . in .the telbmeric 
^ :.region:Of the chromosomes have been preyi- 
v oiisly documented (75); vFrpm this mapping 
result, there is a difference of ,4.99 between 
^ lowest rates and highest rates and the largest . 
/ .difference pf 4.4 between.males ^d females ' 
(4.99 to 0.47.on chromosome 16). This indi- 
cates that the variability in recombination 
rates .among regions of the genome exceeds 
.the differences . in .recombination rates be- 
tween males., and females.. The human ge- ; 
.nome has recombination hotspots, .where re- 
combination rates vary fivefold or more over ' . 
: ,: a space of 1 kbp, so the picture/one gets of the . 
: magnitude : of ; variability ;;in TCcbmbination : i 
.; rate .. will . depend on . the ; size .-of the window - 



Table 9. Characteristics of C+C In isochores. 


Isochore 


G+C (%} 


Fraction of 


genome 


Fraction of genes 




Predicted*- 


Observed 


" Predicted'* 


Observed 


H3 

H1/H2 
L 


>48 
43-48 
<43 


5 
25 
67 


9.5 
21.2 
69.2 


37 
32 
31 


24.8 
25.6 
48.5 


•The predirtions were based on Bemardi 


s definitions (70) of the Isochore structure of the human 


genome. 



Fig. 9, Comparison of 
the number of exons 
per transcript between 
the 17.968 Otto tran- 
scripts and 21350 de 
novo transcript predic- 
tions with at least one 
line of evidence that 
do not overlap with an 
Otto prediction. Both 
sets have the highest, 
number of transcripts 
In the two-exon cate- 
gory, but the de novo 
gene predirtions are 
ikewed much more 
fward smaller tran- 
ipts. In the Otto set, 
19.7% of the tran- 
scripts have one or 




I No. of Otto 
transcripts 

No. of de novo + 
1 line of evidence 



J3- 



8 9 10 11 12 13 14 
Number of exons per transcript 



15 16 17 18 



19 



20 >20 



two exons. and 5.7% . 

have more than 20. In the de novo set 49.3% of the transaipts have one or two exons. and 0.2% have more than 20. 
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..that account for gene. inactivation. .The gen-, -- 
. eral structural characteristics of these : pro- 

cessed pseudogenes. include ^ the . complete v 
. lack of intervening sequences found in the.,-. 
■ functional counterparts, a poly(A) tract at the 
', 3' end, and direct repeats flanking the pseu- r 
:dogene. sequence. Processed pseudogenes oc-/ > 
. cur as a . result of retrotransposition, whereas X'. 
unprocessed pseudogenes arise from segmen- v 
tal genome duplication. ... : ; 

. . . .We searched the complete set of Otto- . ,;' 
•predicted transcripts against the. genomic se- . 
, quence by means of BLAST.: Genomic re- 
gions corresponding . to all .-Otto-predicted ■ 
. transcripts \yere excluded from this analysis. ■ . 
\We identified 2909 regions ^matching ..with J;;; 
;greater than 70% identity over at least 70% of ;/; 
the length of the transcripts that likely repre- f V 
sent processed pseudogenes. :This number is .< 
probably an underestimate because specific - 
methods to search for pseudogenes were not 
used. - ' 

. We looked for ; correlations between 
structural elements and the propensity for . i 
retrotransposition in the human genome. 
GC content and transcript length were com- 
pared between the genes with processed . 



pseudpgenes;j(ll 77. ^ source.-, genes) ~ versus 

■ jthe - remainder .^of the predicted gene set. 
ivTranscripts.that give rise to processed pseu-' 
. dogenes * have/' shorter ..average - transcript 

length (1027 bp versus 1594 bp for the Otto 
set), as ^compared with genes for which no, 

Vpseudpgene was detected.* The overall GC; 

V content , did not shoW' any significant differ- 

■■ence, cbritTaiy to a recent report (88), There . 

;is: a..clear.frend,in^;gene families that. are. 
present * as" processed pseudogenes. ; These 

;include/ribo^omal : proteins, (67%), lamin 

■ receptors (10%), translation elongation fac-. 
; tor alpha (5%), and HMG-non-histone pro- 
teins. (2%). The. increased occurrence of 
:}retrotransposition (both intronless paralogs 
{and processed pseudogenes) .among genes 
^involved, in translation and nuclear regula- 
.tion may reflect an "increased .transcription- ; 
al. activity of these genes^ / - Y 

S3 Gene duplication In the human . 
genome 

Building on a previously published procedure 
(27), we developed a graph-theoretic algo- 
rithm, called Lek, for grouping the predicted 
human, protein set into protein families (89), 



Table 13. Characteristics of CpC Islands identified in chromosome 22 (34-Mbp sequence length) and the 
whole genome (2.9-Gbp sequence length) by means of two different methods. Method 1 uses a CG 
lilcelihood ratio of ^0.6. Method 2 uses a CG .lilcelihood ratio of ^0.8. ... 



Chromosome 22 



Whole genome 
(CS assembly) 





Method 1 


Method 2 


Method 1 


Method 2 


Number of CpC Islands 


5,211 


522 


195.706 


26.876 


detected 










Average length of island (bp) 


' - 390 


535 


395 


497 


Percent of sequence 


5.9 


0.8 


2.6 


0.4 


predicted as CpG 










Percent of first exons that 


44 


25 


42 


22 


overtap a CpG island 










Percent of first exons with 


37 


22 


40 


21 


first position of exon 










contained inside a CpG 










Island 










Average distance between 


1.013 


10.486 


2.182 


17.021 


first exon and closest CpG 










Island (bp) 










Expected distance between 


3.262 


32.567 


7.164 


55.811 


first exon and closest CpG 










Island (bp) 










Table 14. Distribution of repetitive DNA in the compartnientallzed shotgun assembly sequence. ^ 






Megabases In 


Percent 


Previously 


Repetitive elements 




assembled 


of. 


predicted 




sequences 


assembly 


(%) (83) 


Alu 




288 


9.9 


10.0 


Mammalian interspersed repeat (MIR) 




66 


2.3 


1.7 


Medium reiteration (MER) 




50 


1.7 


1.6 


Long terminal repeat (LTR) 




155 


5.3 


5.6 


Long Interspersed nucleotide element 




466 


16.1 


16.7 


(LINE) 










Total 




1025 


35.3 


35.6 



, VJ^The:;J complete/ cluste^^^^ that result from the 
-.^ Lek clustering provide one basis for compar- 
; j.ing the role of whole-jgenome or chromosom- 
al duplication in protein family expansion as 
opposed to other means, such as tandem du- 
plication. Because each! complete cluster rep- 
, resents, a closed and; certain island of homol- 
V Pgy* .and because , Lek is capable of rsimulta- v : 
neously .clustering .protein /complements of 

- several . prganismsi : the . number ; of -proteins 

0 contributed by each organism tp :a complete . 
„■ cluster, can , be predicted . with, confidence de- 

. . pending on the quality of the.iainnotation of . 

each genome. The variance of each organ- 
. ;, ism's contribution to each cluster can then be . 

1 .. (calculated, allowing an assessment of the rel- 
': ative:irnportance. of large-scale idupU 

versus : smaller-scale, ' organism-specific ex- 
;-;;pansion .and. contraction of protein families, 
• presumably , as a result of natural selection 
operating on individual protein families with- 

- in an organism. As can be seen in Fig. 12, the 
large variance ;in, the relative numbers of hu- 

. : man as compared with Z>. melanogaster and 
Caenorhahditis elegans proteins in complete 
clusters may be explaiiied by multiple events 
of relative expansions in gene ■faiiiilies in ' 
each of the three animal genomes. Such ex- 
. pansions would .give rise to the . distribution 
that shows a . peak, at 1:1 in the ratio for 
human- worm or;humah-fly clusters with the : 

. /slope spread, covering both; human :and fly/ . / 

■ .worm ■ predominance, as r we ^observed (Fig. 
12). Furthermore, there are nearly as many 
clusters where worm and fly proteins pre- 
dominate despite the larger numbers of pro- 
teins in the human. At face value, this anal- 
ysis suggests that natural selecdpn acting on 

, individual protein families has been a major 
force driving the expansion of at least some 
: elements of the hiunan protein set. However, 
in our analysis, the difference between an 
ancient whole-genome duplication followed 
by loss, versus piecemeal duplication, cannot 
be easily distinguished. In order to differen- 
tiate these scenarios, more extended analyses 
were performed. 

5.4 Large-scale duplications 

Using two independent methods, we 
searched for large-scale duplications in the 
human genome. First, we describe a protein 
family-based method that identified highly - 
conserved blocks of duplication. We then 
. describe our comprehensive naethod for identi- 
fying all interchromosomal block duplications. 
The latter method identified a large number of 
duplicated chromosomal segments covering 
parts of all 24 chromosomes. 

The first of the methods is based on the 
idea of searching for blocks of highly con- 
served homologous proteins that occur in 
more than one location on the genome. For 
this comparison, two genes were considered 
equivalent if their protein products were de- 
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By this . measxire, . the duplication segment 

spans nearly half of each chromosome's net, 
; length; The most likely, scenario is that the ^ 
• whole span of this region was duplicated as a. 

single very large block, followed by shuffling 
-owing to smaller scale rearrangements. As ;; 

such, at least four subsequent rearrsmgements,'; 
; would need to be - invoked , to explain., ^he ^ 
; relative insertions and inversions seen in: the ., 
i ..duplicated segment interval. :The 64 protein;;! 
; pairs in this alignment. occur among 217 pro-.;. 
: ; tein. assignments on chromosome 18, ; and,; 
i among 322 protein assignments on chromo-. ; 
. some 20, for a density of inyolyed proteins of . 
, 20 to 30%.;-This.is consistent with an ancient 

large-scale \ duplication followed, by subse-.;; 
:.; quent gene loss on one or both chromosomes. : 

Loss of just one member of a gene pair . 
, subsequent to the duplication would result in 

a failure to score a gene pair in the block; less 

than '50% ' gene loss on the chromosomes [ ' 
•would . lead to . the. duplication . density ob- . 
•served here. As.an independent verification , 
. of the significance ..of the alignments detect- . 
. ed, it can be seen that a substantial number.of ... 
. the pairs of aligning proteins in this duplica- .-. 
tion, including some of those annotated (Fig. : 
, 13), are those populating small Lek complete : 
., clusters (see above). This indicates that they 
, are members of very, small families of para- 
. logs; their relative scarcity within the genome • 
validates the uniqueness and robust nature of . 
their alignments. • . 

Two additional qualitative features were ob- , 
served among many of the large-scale duplica- 
tions. First, several proteins with disease asso- 
. ciations, with OMEM (Online Mendelian Inher- . 
itance in .Man) assignments, are members -of . 
duplicated segments (see web table 2 on Sci- . 
ence Online at www.sciencemag.org/cgi/con- ; 
.tent/full/291/5507/1304/DCl). We have also „ 
observed a few instances where paralogs on 
both duplicated segments are associated with 
similar disease conditions. Notable among 
these genes are proteins involved in hemostasis 
(coagulation factors) that are associated with 
bleeding disorders, transcriptional regulators 
like the homeobox proteins associated with de- 
velopmental disorders, and potassium channels 
associated with cardiovascular conduction ab- 
normalities. For each of these disease genes, 
closer study of the paralogous genes in the 
duplicated segment may reveal new insights 
into disease causation, with further investiga- 
tion needed to determine whetherthey might be 
involved in the same or similar genetic diseases. 
Second, although there is a conserved number 
of proteins and coding exons predicted for spe- 
cific large duplicated spans within the chromo- 
some 18 to 20 alignment, the genomic DNA of 
chromosome 18 in these specific spans is in 
some cases more than 10-fold longer than the 
corresponding chromosome 20 DNA. This se- 
lective accretion of noncoding DNA (or con- 
versely, loss of noncoding DNA) on one of a 



^ ^/paiTi-pf ^duplicated chromosome regions was 
• - .observed in many, compared regions. -Hypothe- 
.; . rises ;to explain which mechanisms foster these 
..processes must be tested. 

. Evaluation . of ,the.. alignment results .:gives 
sonie perspective on.dating of the. duplications!'' 
-^'.,As;npted above,- largCTScale ancient segmental 
:;;;duphcatipn. in :fact.:ibest;e?^^^ jmany of the 

V .blocks ^detected by this.^ 
vlLlTie /regions of human chromosomeis involved 

in Ae ' large-scale !du^ expanded upon 

above (chromosomes 2 to. 14, 2 to 12, and 18 to 

■ V, 20) are .each syntenic.to a distinct mouse chro- . 
;';mpsomal .region: .,.1^6; corresponding mo^^^ 
•V ; chromosomal regions are much more similar in ■> 
j.., sequence conservation, - and even .in order, to ; 

their human synteny. pjutners than - the human 
J : duplicatioii regions,are to each, other. Furtheri • 
r the corresponding mouse chromosomal regions '■■ 

.. each, bear a significant proportion of genes ox-i 
■ thologous to the" human genes ^ on which the ■ 
•> human duplication assignments were made. On 
;; the. basis: of these factors,,- the -corresponding t 
. mouse ;chromosomal spans, at coarse resolu-; : 
/.tion, appear to be products of the. same large- , 
vjscale duplications , observed ;in humans. Al- . 
< though further detailed analysis must be earn 

out once a more complete genome is assembled 
. for mouse, the underlying large duplications , 

appear to predate the two species'- divergence. •> 
. ; ;dates the duplications, at the latest, before : 
^divergence of the primate and rodent lineages. 

V 'This date can be further refined upon examina- 

■ tion of the synteny between human chromo- 
somes and those of chicken, pufferfish {Fugu 
rubripes), or zebrafish (95), The only sub- 
stantial syntenic stretches mapped in these 
species corresponding to both pairs of human 

. duplications are restricted to the Hox cluster 
, regions. - ,When the synteny of .these regions; 
(or others) to human chromosomes is extend- 
ed with further mapping, the ages of the 
nearly chromosome-length duplications seen 
in humans are likely to be dated to the root of 
vertebrate divergence. 

The MUMmer-based results demonstrate 
large block duplications that range in size from 
a few genes to segments covering most of a 
chromosome. The extent of segmental duplica- 
tions raises the question of whether an ancient 
whole-genome duplication event is the under- 
lying explanation for the numerous duplicated 
regions (96), The duplications have undergone 
many deletioris and subsequent reanangements; 
these events make it difficult to. distinguish 
between a whole-genome duplication and mul- 
tiple smaller events. Further analysis, focused 
especially on comparing the estimated ages of 
all the block duplications, derived partially 
from interspecies genome comparisons, will be 
necessary to determine which of these two hy- 
potheses is more likely. Comparisons of ge- 
nomes of different vertebrates, and even cross- 
phyla genome comparisons, will allow for the 
deconvolution of duplications to eventually re- 



;^ ;.veal the;stag;ewige history of our genome, and 
: with it a history of the emergence of many 
•:;";:the.key:functions'that distinguish us from other 
living things. 

^ii^- 6 A Cenome-Wide^^^^ of 1 

>:^rSequence . Variations ; 

\::':Sup7maryy^^ were ;u^'^- 

to jderit^yj^irigleTnuc^ 
■r;: (SNPs) by corriparisbn of the Celera sequence . 
' to /other. SNP ^ resources. - The SNP - rate be- . 

tween two chromosomes was ~1 per 1200 lo 
.,. 1500 bp. SNPs :are . distributed nonrandomly 
. /throughout the ;genome. Only a very, small 
: ;v-proportion fof .-all SNPs (< 1%) pot^rilially 
.• '^impact protein ifunction based on the Tunc- 
'.j'-tioriaL analysis of. SNPs that affect the pre- 
vi\dicted 'Coding regions. 'This results in an cs* 
>;-Jtimate that' only thousands, not millions, of 
:h genetic variations may contribute to the struc- 

■ 1' tural, diversity of human proteins. 

;;c Having a complete genome sequence cniililci 
iresearchers to achieve a dramatic acceleration 
in the rate bf^gene discovery, but only ihrouj^ 
analysis, of sequence variatioii in DNA can wc 
discover the genetic basis for variation in health 
. among human beings. Whole-genome shotgun 
• sequencing is a particularly effective mcthcxi 
.. for detecting sequence variation in tandeni with. 

■ whole-geripme assembly. In addition, we com- 
pared the distribution and attributes of SNPs ' 

, -ascertained by three other; methods: (i) align- 
.ment of the Celera consensus sequence to the 
PFP assembly, (ii) overlap of high-quality re;tc!s 
of genomic sequence (referred to as "Kwok"; 
1,120,195 SNPs) (97% and (iii) reduced rcpir- 
sentation shotgun sequenciiig (refesred to as 
5 /TSCV; 632,640 SNPs) (P<?). These data were 
...consistent in showing an overall nucleotide di- 
versity of -8 X 10"^ mariced heterogeneity 
across the genome in SNP density, and on 
overwhelming preponderance of noncoding 
variatiori that produces no change in expressed 
proteins. 

6.1 SNPs found by aligning the Celera 
consensus to the PFP assembly 

Ideally, methods of SNP discovery make full 
use of sequence depth and quality at every site, 
and quantitatively control the rate of false-pos- 
itive and false-negative calls with an cxphcii 
sampling model (PP). Comparison of consensus 
sequences in the absence of these details neces- 
sitated a more ad hoc approach (quality scorvs 
could not readily be obtained for the PFP as- 
sembly). First, all sequence differences between 
the two consensus sequences were identifie . 
these were then filtered to reduce the co"*"^^ 
tion of sequencing errors and misassembly 
a measure of the effectiveness of the filtennt 

step, we monitored the ratio of ^JJo 
transversion substitutions, because a 2:1 

has been well documented as typical m m 
malian evolution (WO) and in human • 
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. Rg. 13. Segmental duplica- - 

- tions between chromo- 
-sonnes in the human ge- ' 
nome. The 24 panels show 
the 1077 duplicated blocks 
ofgenes, containing 10310 . 
pairs of genes In total Each 

' - line represents a pair of ho- 
. mologous genes belonging ^ 

. to a block; all blocks con- 

^ tain at least three genes 
on , each of the chromo- 
, somes where they appear. 
Each panel shows all the 

.-. duplications between a ' 
single . chromosome and • 
. other chromosomes with 
.shared blocks. The chro- \ 

' . mosome at the center of ' , 
each panel is shown as a " 
thick red line for emphasis. 
Other chromosomes are 
displayed from top to bot- 
tom within each panel or- 
dered by chromosome 
number. The inset (bot- 
tom, center right) shows a 
close-up of one duplica- 
tion between chromo- 
somes 18 and 20, expand- 
ed to display the gene 
names of 12 of the 64 
gene pairs shown. 
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i . somes, .; and : whether . ; this heterogeneity - is . - 

• greater .than expected by chance. If SNPs > 

r occur by random and independent mutations; ' 
then it would seem that there ought to be a ; . 
Poisson distribution of numbers of SNPs in . 

- fragments of arbitrary constant size. The, ob- ; 
served dispersion in the distribution of SNPs . 
vin 100-kbp .fragments was far g^reater than , 
predicted from a Poisson distribution (Fig. ^ 

.14). However, this simplistic , model ignores .:.; 

.. .the different recombination rates and popula- 
tion histories that exist in different regions of 

: the genome. Population genetics theory holds 

: .,that we can account for this variation with a . .. 
mathematical formulation called ,the neutral ■ 
coalescent (109): Applying -well-tested algo: 

..vrithms for simulating the neutral coalescent j 
.with recombination {1 10),. and using an.ef-, 

; fective population size of 10,000 and a per- 
base recombination rate equal to the mutation . 
rate (/ii), we generated a distribution of num- 
bers of SNPs by this model as well (112). The 

. : observed distribution of SNPs has a much larg- : 
er variance than either the Poisson model or the 
coalescent model, and the difference is higjily 
significant This implies that there is significant 

. variability across the genome in SNP density,- 
an observation that begs an explanation. 

Several attributes of the DNA sequence 
may. affect the local density of SNPs, in- 

.; eluding the rate at. which DNA. polymerase ^ 
makes errors and the efficacy of mismatch 
repair. One key factor that is likely to be 
associated with SNP density is the G+C 
content, in part because methylated cy- 
tosines in CpG dinucleotides tend to under- . 
go deamination to form thymine, account- 
ing for a nearly 10-fold increase in the 
mutation rate of CpGs over other dinucle- 
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v.otides:;. We tallied the GC content arid nu- *. 
■.cleotide- diversities : in .100-kbp - windows ^ 

■ across the entire genome and found that the >. 
correlation between them was positive (r = v 
0.21) and highly significant (F;< 0.0001), 

^but G-fC -content -accoimted. for, only a,; 
; small part of the variation. . . -- y .}^ . ^v./ 

. 6.5 SNPs by genomic class ^ 

iTo:'t(Sst:;;hompgerieity./of cSNP.'rdensities . 
across .functional ..classes, we partitioned .' 
• sites ' into : intergenic (defined as ;>5 kbp 

■ from any predicted transcription unit), 5'- 
UTRi : exonic (missense and silent), in- ■ 
tronic, and 3 VUTR for . 10,239 knoym 

;/5 genes, -derived, from the NCBr*RefSeq da-:,; 
; ^tabase arid, all human genes predicted from , 
the Celera \Otto - aimotation.' In coding, re-;- ' 
J- gions, SNPs were. categorized as either. si- .!r 
V lent, for those that do not change amino ^ 
acid sequence, . or missense, for those that 
. . change the protein product. The ratio of 
missense to. silent coding SNPs in' Celera- 
. PFP, TSC, and Kwok sets (1.12, 0.91, and 
• 0.78, respectively) shows a markedly re- 
^ duced frequency of missense variants com- 
. pared with thi^ . neutral expectation, consis- 
. tent with the elimination by natural selec- 
tion of a fraction of the deleterious amino . 
acid, changes (7/2). These ratios are com- • 
. ;.parable. to. the missense-to-silent ratios of . 
/ 0.88 and i j? found by Cztgiil et aL {101) 
-and by Halushka et al. (702). Similar re-. 
- suits were observed in SNPs derived from 
Celera shotgun sequences (46). 

It is striking how small is the fraction of ^ 
SNPs that lead to potentially dysfunctional 
. alterations in proteins. In the 10,239 Ref- 
■\ Seq genes, missense SNPs were only about 




Number of SNPs / 100 kb 

Fig. 14. SNP density In each 100-kbp Interval as determined with Celera-PFP SNPs. The color codes 
are as follows: black. Celera-PFP SNP density; blue, coalescent model; and red, Poisson distribution. 
The figure shows that the distribution of SNPs along the genome is nonrandom and Is not entirely 
accounted for by a coalescent model of regional history. 



:.;0.12,-;.0:14/rahd^0:17% of the total SNP 
.vx.ounts ';an\ Celera-PFP, " TSC, and Kwok 
•,';SNPs/ respectiyely.' -Nonconservative pro- 
f-tein' changes constitute an even smaller frac- 
v .don . of . missense SNPs (47, 41, and 40% in 
5 Celera-PFP, 5Kwok, and vTSC). Intergenic re- 

■ gions have been ;virtually, unstudied (773), and 

Ave : note;that 75% : of . the^ SNPs We ^identified 
.;were mterg6nic *(Table!' 17). The SNP rate was . 
^::higihest in iritrons:and lowest in exons: The SNP, 
. -irate. was lower. in .intergenic regions than in 
. ; introns, providing one of the first discriminators 
; between these two classes of DNA.' These SNP 
> rates were confimied in the Celera SNPs, which 

also .exhibited a lower rate . in exons than in 
y introns} and in extragenic regions than in in- 
: trons (^5). Many, of these intergenic SNPs will 

provide ' valuable "information in the form of 
.. markers for linkage and association studies, and 
Vsome fraction 'is likely to have a regulatory 
. function as well, ■ 

A 7 An Overview of the . Predicted 
!. Protein-Coding Genes in the Human 
Genome 

Summary, TK\s sectioii provides an initial 
computational analysis of the predicted 
protein set with the aim of cataloging 

. prominent . differences * and "siniilarities 
when the hmnan genome; is compared with 

V other, fully- sequenced etikaryotic genomes. 

: V Over 40%: of, the -predicted protein set in - 
.hiimans; cannot be. ascribed .a* molecular 
function by methods that assign proteins to 
known families. - A* protein domain-based 

.analysis provides a detailed catalog of the 
prominent : differences in the human ge- 
nome when compared with the fly and 

: worm genomes. Prominent among these are 
domain expansions in proteins involved in 

^.developmental regulation and in cellular 
processes such as neuronal function, hemo- 
stasis, acquired immune response, and cy- 
toskeletal complexity. The final enumera- 
tion of protein families and details of pro- 
tein structure will rely on additional exper- 
imental work and comprehensive manual 
curation. 

A preliminary analysis of the predicted hu- 
man protein-coding genes was conducted. 
Two methods were used to analyze and clas- 
sify the molecular functions of 26,588 pre- 
dicted proteins that represent 26,383 gene 
. predictions wifh at least tiydlines of evidence 
as described above. The first method was 
based on an analysis at the level of protem 
families, with both the publicly available 
Pfam database {114, 115) and Celera's Pan- 
ther Classification (CPC) (Fig. 15) {116). 
The second method was based on an analysis 
at the level of protein domains, with both the 
Pfam and SMART databases (775, 777). 

The results presented here are prelimi- 
nary and are-subject to several limitations. 
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7,2 Evolutionary conservation of core 
processes 

:,. Because of the .various . "model organism" - 
genome-sequencing .projects that have • al- : 

, , ready been completed, reasonable compara-. 
; live infonnation is available for beginning the 

; . analysis of, the evolution of the human ge-, 

; nome. .The genomes of S./cerevisiae C'bak-^ 

: ers * yeast") {118) and two " diverse inverte-^ 
brates, C .e/e^a/i^ (a nematode worm) (ll9) 

. and D. melanogasier (fly) (25), as well as the ' 
fu-st plant genome, A. thaliana, recently com- ; 

. . pleted {92), provide a diverse background for ^- 

. ' genome comparisons! . s'::.'? 
We enumerated the "strict orthologs" c6n- 
served between human and fly, and between. ; 
. human and worin (Fig. ,16) to, address .the s - 

.. question. What are the core functions - that ;-: 
appear to be common across the animals?,;: 
The concept of orthology . is important be- : 

. cause if two genes are orthologs, they can be • 
traced by descent to the common ancestor of * 
the , two. organisms (an "evplutionarily . con-,l. 

^ served protein set"), and therefore are likely • 
to perform similar conserved functions in the / 
different organisms. It is critical in this anal-, 
ysis to separate orthologs (a gene that appears : 
in two organisms by descent from a common 
ancestor) from paralogs (a gene that appears 

. in more than one copy in a given organism by 

; a duplication event) because paralogs may ,■ ; 
subsequeiitly diverge in function. Following ; \ 
the yeast-wprm ortholog ■ comparison , in . ' / 
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(72^?), .we identified, two different cases 'for. 
* . ;. each . pairwise comparison (humari-fly ; and 
.:;.;human-wonn); The first case .was a pair of 
V .'. genesy , one from each organism, ; for which 
v.there was no :other close -homolog in either 
. .- ..organism. .These are straightforwardly identi- 
V; Afied - as .orthologous, • because . there are : no : 
; additional members of the; families that cpm-^ 
i plicate, separating orthologs from paralogs. . 
A The . second , case is. a family of , genes ■ vvith ; 
; ::more than one member in.either or both of the. 
: .organisms being compared. Chervitz et al. 
; (72^) ..deal with this case by analyzing a 
i.'.phylogenetic tree that described the relation- 
' ships, between all of the '. sequences , in both 
.organisms, and. then looked for pairs of genes 
y:that;.were nearest.neighbors in the free. If the > 
y nearest-neighbor pairs, were - from ■: different 
;:.organisms, those genes were presumed. to be; ; 
.- orthologs. We note that these nearest neigh-. 
V borsxan often be confidently identified from 
pairwise sequence comparison without hay-. 
. ing to examine a phylogenetic- tree (see leg-; 
end to ;Fig. 16). If the nearest neighbors .are 
not from different organisms, there, has been 
. a paralogous expansion in one or both organ- 
isms after the speciation event (and/or a gene 
loss by one organism). When this one-to-one 
= coirespondence is lost, defining an ortholog 
; becomes ambiguous. For our initial cqmpu- 
..tational overview, of the predicted human pro-.- 
tein set, we could not answer this question, for ; 
every predicted protein. Therefore, we - con- : 



:>.^^;sider:only."strict^ortHol6gs," i.e., the proteins 
with ^unambiguous : one-to-one relationships 
; :^(Fig:U6):;By 'the^^^^^^^ there are 2758 

vV; strict .•humaii-fly^^^^^^^^^^ 2031 human- 

•;. v.-worni -(1523 .in :commbn between these sets). 
;' ;: We.define-.the eyolutionarily conserved set as 
:}:;^;,those M523 . huinah proteins; that have s^ct 
.^';v'Orthologs 'm:hoih:ip^ :q.: 
elegans. i y '\'\ "' r"- • 

' ; v;.,.:The/ di stributi on ; of -the.:; functions : of • the 
.conserved protein;.set is : shown in; Fig. .16. 
; Comparison with Fig. ;J5.; 'shows that, not 
. surprisingly, the. set of conserved proteins is 
,.not distributed among molecular ;fiinctions in 
. ..the same way as the whole human protein set. 
■vGompared .with;;the, w^ human set (Fig. 
X I S)> there are several categories that are over- 
represented in the^conseiVed set by a factor of 
. ^.2: or more. The first category, is nucleic acid 
■■^ -en2ymes,.'-primarily the - transcriptional ma- 
-chinery ; (notably V DNA/RNA methyltrans- 
T\.ferases, :DNA/FiNA; pol^ helicases, 
^;:^ DNA.-.Iigases,--pNAr ' and^^RNA-processing 
factors, nucleases, and ribosomal protems). 
;The. basic transcriptional and translational 
machinery , is well known to have been con- 
served over evolution, from bacteria through 
to the most complex eukaiyotes.; Many ribo- 
. nucleoproteins involved in RNA; splicing also 
, appear to be. conserved. among the animals.. 
^ Other enzyme types are also oy'errep 
;- ed ; (transferases,'.'6xiddreductases', 'ligases, 
v.Iyases,' and isbnierases). -Many of. these en- 



Fig, 16. Functions of putative 
orthologs across vertebrate 
and Invertebrate genomes. 
Each slice lists the number and 
percentages (in parentheses) 
. of "strict orthologs" between 
the human, fly. and wonn ge- 
nomes involved in a given cat- 
egory of molecular function. 
"Strict orthologs" are defined 
here as bi-directional BLAST 
best hits (780) such that each 
orthologous pair (i) has a 
BIASTP P-value of ^lO-""^ 
(720), and (il) has a more sig- 
nificant BLASTP score than 
any paralogs in either organ- 
ism, i.e„ there has likely been 
no duplication subsequent to 
speciation that might make 
the orthology ambiguous. This 
measure is quite strict and is a 
lower bound on the number of . 
orthologs. By these criteria, 
there are 2758 strict human- 
fly orthologs, and 2031 hu- 
man-wonin orthologs (1523 In 
common between these sets). 



cytoskcletal structural protein (20. 1 .2%) 
. chapcronc (16. 0.9%\ 
cell adhesion 01. 0.6%), 
miscellaneous (72, 4.2%) ^ 
. viral protein (4» 0.2%) . 
. transfer/camcrprotein(ll,0.6%)> 
transcription factor (8 1, 4.7%) . 



nucleic acid cn;^*me(221, 12.9%) 



extracellular mairix ( 1 2, 0.7%) 
ion channel (7, 0.4%) 

motor (13, 0.8%) . ... • — . . _ .. . . 
,str\ictural protein of muscle (8. 0.5%) 
, protobncogenc (23, 13%) 

inlraccllular transporter (5 1 , 3,0%) 

transporter (44, 2.6%) 



receptor (23.13%) 



kinase (69, 4.0%) 



select regulatory molecule (88, 5.1%) 



transferase (70, 4.1%) 




synthase and synthetase (64» 3.7%) 

cxidorcductasc(64, 3.7%) 

Iyase(l2.0.7%) 
Ilgasc(9,0.5%)' 



molecular function unknou-n (613, 35.S%) 



hydrolase (80,4.7%) 
isomerose (21,1 .2%) 
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Table 18. Domain-based comparative analysis of proteins.in H. sapiens (H), .. 
D. melanogaster(f), C e/epaw. (W), S. cerews/ae.(Y)/and A: t/ia//ana (A). The:" 



1^^^ ' ■ predicted protein set of each of the above eukaryotic organisms was analyzed . 
.^H' with Pfam version 5.5 using E value cutoffs of 0.001. T;he number of proteins 
; containing the specified Pfam domains as well as the total number of domains 
.r C= . ' (in parentheses) are shown in each column. Domains were categorized into 
cellular processes for presentation. Some domains (i.e., SH2) are listed in 



:; more than one cellular process.: Results of the Pfam.analysis may differ from 
V results obtained based.on human curatipn:pf'f)rqtein farnilies. owing to the 
.i lirriitations of large-scale: a utomatic:classifications.' Representative examples 
.;: of domains with reduced counts' owing.to the stringent E value cutoff used for 
v.-ahis. analysis are'marked with a. double asterisk (**).^ Examples Include short 
divergent and predominantly alpha-helical. domains.-. and certain classes of 
. . .cysteine-rich zinc finger proteins. " / " 



Accession 
number 



Domain name 



Domain description 



H 



W 



■ PF02039 


Adrenomedullin 


PF00212 


ANP 


/PF00028 


Cadherin 


,PF00214 


Calc^CCRPJAPP 


PF01110 


CNTF 


PF01093 


^Xlusterin . 


: PFb0029 


Connexin 


. PF00976 


. ACTH^domain 


; PF00473 


; - CRF 


PF00007 


Cys_knot 


PF00778 


DIX 


PF00322 


Endothelin 


PF00812 


Ephrin 


PF01404 


EPhJbd 


PF00167 


. FGF 


PF01534 


Frizzled 


PF00236 


Hormones 


PF01153 


Clypican 


PF01271 


Cranin 


PF02058 


Guanylin 


PF00049 


Insulin 


PF00219 


IGFBP 


PF02024 


Leptin 


PF00193 


Xlink 


PF00243 


NGF . 


PF02158 


. Neuregulin . 


PF06l84 


: Hornrione5 




KIMM 
INMU 


PF00066 


Notch 


PF00865 


Osteopontin 


PF00159 


Hormone3 


PF01279 


Parathyroid 


PF00123 


, Hormone2 


RF00341 


^ PDCF .-.-^ 


PF01403 


Sema 


PF01033 


Somatomedin^ 


PF00103 


Hormone 


PF02208 


Sorb 


PF02404 


SCF 


PF01034 


Syndecan 


PF00020 


TNFR„c6 


PF00019 


TGF-p 


PF01099 


Uteroglobin 


PF01160 


Opipds_neuropep 


KrOOnO 


Wnt 


PF01821 


ANATO 


PF00386 


Clq 


PF00200 


Disintegrin 


PF00754 


F5_F8_type_C 


PF01410 


COLFI 


PF00039 


Fnl ^ 


PF00040 


Fn2 


PFOOOSl 


Kringie 


PF01823 


MACPF 


PF00354 


Pentaxin 


PF00277 


SAA_proteIns 


PF00084 


Sushi 


PF02210 


TSPN 


PF01108 


Tissue.fac 


PF00868 


Transglutamin_N 


PF00927 


Transglutamln.C 



Adrenomedullin 
■ . Atrial natriuretic peptide 
, Cadherin domain 
Caldtonin/CGRP/IAPP family 
, . : Ciliary neurotrophic fartor 
Clusterin 
Connexin 

Corticotropin ACTH domain 
; Corticotropin-releasing factor family 
. Cystine-knot domain 

Dix domain . 

Endothelin family 

Ephrin 

Ephrin receptor ligand binding domain 
.Fibroblast growth factor 
. -Frizzled/Smoothened family membrane region 
Glycoprotein hormones 
Clypican 

. Grainin (chromogranin or secretogranin) 
. Guanylin precursor 

Insulin/iGF/Relaxin family 

Insulin-like growth factor binding proteins 

Leptin 

LINK (hyaluron binding) 
. Nerve growth factor family 

Neuregulin family 
. Neurohypophysial hormones . 
Neuromedin U 
Notch (DSL) domain 
Osteopontin 

Panaeatic honmone peptides 
Parathyroid hormone family 
Peptide hormone 

Platelet-derived growth factor (PDGF) " ■ - 
Sema domain 
• Somatomedin B domain 
Somatotropin 

Sorbin homologous domain 

Stem cell factor 

Syndecan domain 

TNFR/NGFR ^steine-rich region 

Transforming growth factor p-like domain 

Uteroglobin family 

Vertebrate endogenous opioids neuropeptide 
Wnt family of developmental signaling proteins 

Hemostasis 

Anaphylotoxin-like domain 

Clq domain 

Disintegrin 

F5/8 type C domain 

Fibrillar collagen C-tenminal domain 

Fibrpnectiri type I domain' 

Fibronectin type 11 domain / 

Kringie domain 

MAC/Perforin domain 

Pentaxin family 

Serum amyloid A protein 

Sushi domain (SCR repeat) 

Thrombospondin N-terminaWike domains 

Tissue factor 

Transglutaminase family 

Transglutaminase family 



regulators 

1 
2 

100(550) 
3 
1 
3 

14(16) 
.1 
2 

10(11) 
5 

7(8) 
12 
23 
9 
1 

14 
3 
1 
7 

10 
1 

13(23) 
. 3 
4 
1 
1 

3(5) 
1 
3 

5(9) 
5 

27(29) . 
5(8) 
1 
2 
2 
3 

17(31) 
27(28) 
3 
3 
18 

6(14) 
24 
18 
15(20) 
10 

5(18) V 
:11(16) 
15(24) 
6 
9 
4 

53 (191) 
14 
1 
6 
8 



0 
0 

14(157) 
0 

. 0 
: 0 
:. 0 
0 

■ '1 

2 

2 

0 

2 

2 

1 

7 

0 

2 

0 

0 

4 

0 

0 

0 

0 

0 

0 

0 

2(4) 
0 
0 
0 
0 

1 

::8(10) 
3 
0 
0 
0 
1 
1 
6 
0 
0 

7(10) 

0 
0 

5(6) 

. 0 
0. 
2 
0 
0 
0 

11(42) 
1 
0 
1 
1 



0 
0 

16(66) 
0 
0 
0 
0 
0 
0 
0 
4 
0 
4 
1 
1 
3 
0 
1 
0 
0 
0 
0 
0 
1 
0 
6 

b 

0 

2(6) 
0 
0 
0 
0 
0 

3(4) 
0 
0 
0 
0 

1 

0 
4 
0 
0 
5 

0 
0 
3 
2 

0 ' 
• 0 
0 
2 
0 
0 
0 

8(45) 
0 
0 
0 
0 



0 
0 
0 
0 

. 0 
0 

- 0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 



0 
0 
0 
0 
0 
0 
0 
0 
0 

6 

0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
. 0 

6 

0 
0 
0 
0 
0 
0 
0 

0 
0 
0 
0 
0 
0 
0 
0 
• 0 
0 
0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 

0 
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Accession 
. number . 


Domain name 


'■ , Domain description 


H 


F 


W 


Y 


A 


■PF00620 


RhoGAP 


■ RhoGAP domain 


. 59 


. . 19 


20 


9 


8 


PF00621 


RhoCEF 


; RhoGEF domain 


: 46 . 


I .23(24) 


v18(19) 


. .3. 


. 0 


PF00536 


SAM 


SAM domain (Sterile alpha motif) 


. ;29(31) 


15 


8 


3 


6 


PF01369 


Sec7 


..Sec7 domain 


13 


5 


5 


- 5 


9 


PF00017 


SH2 


; : Src homology 2 (SH2) domain 


87(95) 


33(39) 


44(48) 


:■ 1 


3 


PF00018 


SH3 


Src homology 3 (SH3) domain 


M43(182) 


55(75) 


. 46(61) 


23(27) 


4 


PF01017 


STAT 


STAT protein ^ 


7 


1 


1(2) 


0 


0 


PF00790 


VHS 


VHS domain . - . 


• * • 4 ■ 


2 


4 


• 4 


8 


PF00568 


WH1 


WH1 domain . 


7 


2 : 


2(3) 


* ■;' 


0 






Domains involved in apoptosis . 








PF00452 


Bd-2 




'* 9 


2 


1 


0 


. 0 


PF02180 " 


BH4 


Bcl-2 homology region 4 


..' ■ ' 3 ■ 


0 


1 


0 


0 


PF00519 . . 


CARD 


. . Caspase recruitment domain 


.16 


0 


2 


0 


0 


PF00531 


Death 


Death domain 


16. 


...:;.-5 


7 


0 - 


^ 0 


PF01335. 


DED 


Death effector domain 


-4(5) 


0 


0 


0 


0 


PF02179 


BAG 


■Domain present in Hsp70 regulators 


. 5(8) 


3 


2 


1 


5 


PF00656 


ICE,p20 


. ICE-lilce protease (caspase] p20 domain 


11 




3 


0 


0 


PF00653 


BIR 


Inhibitor of Apoptosis domain 


8(14) 


5(9) 


2(3) 


1(2) 


0 



Cytoskeletal 

PFOO022 Actin Actin . • 

PF00191 Annexin Annexin 

. PF00402 Calponin Calponin family 

. PF00373 Band_41 FERM domain (Band 4.1 family) 

PF00880 Nebulin_repeat • Nebulin repeat 

- PF00681 Plectin_repeat Plectin repeat 

PF00435 Spectrin Spectrin repeat 

PF00418 Tubulin-binding Tau and MAP proteins, tubulin-binding 

PF00992 Troponin Troponin 

PF02209 VHP Villin headpiece domain 

PF01044 Vinculin Vinculin family 

r BCM adhesion . 

PF01391 Collagen ..... Collagen triple helix repeat (20 copies) 

PF01413 C4 C-terminal tandem repeated domain in type 4 

procollagen 

PF00431 CUB CUB domain 

PF00008 EGF EGF-lilce domain 

PF00147 Fibrinogen^C Fibrinogen beta and gamma chains, C-terminal 

globular domain 

PF00041 Fn3 Fibronectin type III domain 

PF00757 Furin-like Furin-like cysteine rich region 

PF00357 Integrin^A . Integrin alpha cytoplasmic region 

PF00362 . lntegrin_B . Integrins, beta chain 

PF00052 Laminin„B Uminin B (Domain IV) 

PF00053 Laminin_EGF Laminin ECF-like (Domains lll and V) 

PF00054 Laminin.G Laminin G domain 

PF00055 Laminin_Nterm Laminin N-terminal (Domain VI) 

PF00059 Lectin_c Lectin C-type domain 

PF01463 LRRCT Leucine rich repeat C-temiinal domain 

PF01462 LRRNT Leucine rich repeat N-terminal domain 

PF00057 LdLrecept_a Low-density lipoprotein receptor domain class A 

PF00058 LdLrecept_b Low-density lipoprotein receptor repeat class B 

PF00530 SRCR Scavenger receptor cysteine-rich domain 

PF00084 Sushi Sushi domain (SCR repeat) 

PF00090 Tsp_1 Thrombospondin type 1 domain 

PF00092 Vwa von Willebrand factor type A domain 

PF00093 Vwc von Willebrand factor type C domain 

PF00094 Vwd . von Vyillebrand factor type D. domain 

Protein interaction domains 

PFO6244 . ; 14-3-3 14-3-3 proteins. 

PF00023 Ank Ank repeat 

PF00514 Armadillo.seg Armadillo/beta-catenin-like repeats 

PF00168 C2 C2 domain 

PF00027 cNMP.binding Cyclic nucleotide-blnding domain 

PF01 556 DnaJ„C DnaJ C terminal region 

PF00226 DnaJ DnaJ domain 

PF00036 Efhand** EF hand 

PF00611 FCH Fes/CIP4 homology domain 

PF01846 FF FF domain 

PF00498 FHA FHA domain 



61(64) 


15(16) 


12 


9(11) 


24 


16(55) 


4(16) 


4(11) 


0 


6(16) 


13(22) 


3 


7(19) 


0 


0 


29 (30) 


17(19) 


11(14) 


"0 


0 


4(148) 


1(2) 


1 


0 


0 


2(11) 


0 


0 


0 


0 


31(195) 


13(171) 


10(93) 


0 


0 


4(12) 


1(4) 


2(8) 


0 


0 


4 


6 


8 


0 


0 


5 


2 


2 


0 


5 


4 


2 


1. 


0 


0 


' 65 (279) " 


10(46) : 


174(384) 


■ : 0 


0 


.6(11) 


2(4) 


3(6) 


0 


0 


47(69) 


9(47) 


43(67) 


0 


0 


108 (420) 


45(186) 


54(157) 


0 


1 


26 


10(11) 


6 


0 


0 



106 (545) 
5 
3 
8 

8(12) 
24(126) 
30(57) 
10 
47 (76 1 
69 (81 1 
40(44 
35 (127 1 
15(96 
11(461 
53(191 
41 (66 
34 (58 1 
19(28 
15(35) 

20 

145(404} 
22(56) 
73(101) 
26(31) 
12 
44 

83(151) 
9 

4(11) 
13 



42 (168) 
2 
1 
2 

4(7) 
9(62) 
18(42) 
6 

23(24) 
23(30) 
7(13) 
33 (152) 
9(56) 
4(8) 
11(42) 
11(23) 

0 . 
6(11) 
3(7) 

:,3' 
72 (269) 
11(38) 
32 (44) 
21 (33) 
9 
34 

64(117) 
3 

4(10) 

. 15 



34 (156) 
1 
2 
2 

6(10) 
11(65) 
14(26) 
4 

91 (132) 
7(9) 
3(6) 
27(113) 
7(22) 
1(2) 
8(45) 
18(47) 
17(19) 
2(5) 

. : 9 

■ -3 . 
75(223) 
3(11) 
24(35) 
15(20) 
5 
33 
41 (86) 

3(16) 
7 



- :-- 0. 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

. 2 " 
12(20) 
2(10) 
6(9) 
2(3) 
3 
20 
4(11) 
4 

2(5) 
13(14) 



0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

1 

0 

0 : 

15 

66 (111) 
25(67) 
66(90) 
22 
19 
93 

120(328) 
0 

4(8) 
17 
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Table 18 (Continued) 


Accession 


Domain name 


number 


PF02135 


2f-TAZ 


PF01285 


TEA 


PF02176 


. 2f-TRAF . . 


PF00352 


TBP 


PF00567 


TUDOR 


PF00642 . 


. 2f-CCCH : 


'PF00096 


Zf-C2H2** 


PF00097 


2f-C3HC4 


PF00098 


' 2f-CCHC 



• Domain description 

TAZ finger 
TEA domain 
TRAF-type zinc finger 
Transcription factor TFIID (or TATA-binding 
" \protein;TBP) 
TUDOR domain 
= Zinc finger. C-x8-C-x5-C-x3-H type (and similar) • 
Zinc finger/C2H2 type 

Zinc finger, C3HC4 type (RING finger) 
Zinc knuckle 



H 

2(3} 
4 

-6(9) 

. ;.2(4) 

9(24) 
^7(22) 
564(4500) : 
135 (137) 
9(17) 



1(2) 

: - m 
■ y^m^ 

V v v6(8); 
234(771) 
57 
.6(10) 



W 
6(7) 

^ i v2(4) : : 

. : 4(5) 

";22{42) 
^68(155);.- 
88(89) 
17(33) : 



0 

1 • 

- . .0 

' 0 
5- 3(5) -/ 
34(56)i 
18 

: 7(13) 



10(15) 
0 

^V-r2(4): 
■ ;.. 2. 

■31 (46) 
;C21(24) 
298(304) 
68(91) 



.. (Tables 18. and 19). They include secreted 
: hormones^ and growth ifactors, receptors, in- 
" : tracellular signaling molecules/ and transcrip- 
tion factors. ■ 

..pevelopmental signaling molecules that are, 

• enriched in the human genome include growth 
factors such as wnt, transfomning growth fac- 

. tor-p (TGF-3), fibroblast growth factor (FGF), . 
nerve growth fector, platelet derived growth, 
factor (PDGF), and ephiins. These growth fac- - 
tors affect tissue differentiation and a wide 
range of cellular processes involving actin-cy- ' 
toskeletal and nuclear regulation. The corre- 

. . spending receptors of these developmental li- 
gands are also expanded in humans. For exam- 

• pie, our. analysis suggests at least 8 human 
ephrin genes (2 in the fly, 4 in the worm) and 1 2 

; ephrin receptors (2 in the fly, 1 in the womi)..In 
the wnt signaling pathway, we find 18 wnt 
family genes (6 in the fly, 5 in the worm) and 
12 frizzled receptors (6 in the fly, 5 in the ; 
worm). The Groucho family of transcriptional 
corepressors dowr^stream in the wnt pathway 

„..are even more markedly expanded, vyith 13 
predicted memben in humans (2 in the fly, 1 in , 
the worm). 

; ^ Extracellular adhesion molecules involved 
in signaling are expanded in the human genome 
(Tables 18 and 19). The interactions of several 
of these adhesion domains with extracellular 
matrix proteoglycans play a critical role in host 
defense, morphogenesis, and tissue repair 
(131), Consistent with the well-defined role of 
heparan sulfate proteoglycans in modulating 
these interactions (7i2), we observe an expan- 
sion of the heparin sulfate sulfotransferases in 
the human genome relative to womi and fly. 
These sulfotransferases modulate tissue differ- 
entiation (/ii). A similar expansion in humans 
is noted in stmctural pVbteins that constitute th6 
actin-cytoskeletal architecture- Compared with 
the fly ahd worm, we observe an explosive 
expansion of the nebulin (35 domains per pro- ■ 
tein on average), aggrecan (12 domains per 
protein on average), and plectin (5 domains per 
protein on average) repeats in humans. These 
repeats are present in proteins involved in mod- 
ulating the actin-cytoskeleton with predominant 
expression in neuronal, muscle, and vascular 
tissues. 



. Comparison across the fiye sequenced eu- 
,>; karyDt|c; organisms revealed several, expand- 
, .ed protein families and domains involved in 
. : cytoplasmic signal transduction (Table .18). 
; -Tn particular, .signal ; trknsduction ^pathways 
v playing roles in developmental regulation and 
acquired immunity . were' . substantially - en-; 
/.riched. - There is . a factor of 2 or. greater ex- - 
pansion in humans in .the Ras. superfamily 
GTPases and the GTPase activator and GTP:. 
■ exchange ..factors associated . with .them. . Al- 
.though there are about the, same number of 
tyrosine kinases in the human and C. elegans 
genomes, in humans there is an increase in 
the Sm, PTB, and ITAM domains involved 
in phosphotyrosine signal ,traiisduction. Fur- .; 
ther, there:"is: a twofold expansion" of .phos-.-. 
phodiesterases in the human genome. -com- . 
pared with either the worm or fly genomes. ; - 

The downstream effectors of the intracellu- 
lar signaling molecules include the transcription . 
factors that transduce developmental fates. Sig- 
nificant expansions are noted in the ligand- . 
. binding nuclear honnone receptor class of tian- 
,scriptipn factors compared wii the^y genome, 
..although not to .the extent observed in the womi 
(Tables 18 and 19). Perhaps the most striking 
expansion in humans is in the C2H2 zinc finger 
transcription factors. Pfam detects a total of 
4500 C2H2 zinc finger domains in 564 human 
proteins, compared with 771 in 234 fly proteins. 
This means that there has been a dramatic 
expansion not . only in the number of C2H2 
transcription factors, but also in the number of 
these DNA-binding motifs per transcription 
factor (8 on average in humans, 3.3 on average 
in the fly, and 2.3 on average in the worm). 
Furthermore, many of these transcription fac- 
tors contain either the KRAB or SCAN, do- 
: miains, which are not found in the fly or worm 
genomes. These domains are involved in the 
oligomerization of transcription factors and in- 
crease the combinatorial partnering of these 
factors. In general, most of the transcription 
factor domains are shared between the three 
animal genomes, but the reassortment of these 
domains results in organism-specific transcrip- 
tion factor families. The domain combinations 
found in the human, fly, and worm include the 
BTB with C2H2 in the fly and humans, and 



-t-homebdomains ;al6ne /^^^^^^^ combination ;with 
^VvPou .and XIM^^ddmains • in . all of the -animal 
:;;;.rgenomes.(In;plants;;howe^^ a different^set of 
.-. transcription, factors are expanded, namely, the 
|.myb family, and a umque set that includes VPl 
> and AP2 donKiiiHcontainiiig proteins {134), 
.•>;;The yeast ^enbme has a paucity of transcription 
factors compaired - with ./the' multicellular eu- 
karyotes, and. its repertoire . is .limited to . the 
V expansion of the yeast-specific C6 inscription 
- fector family invplyed in metabolic regulation! 
. While we have illustrated expansions in a 
subset of signal transduction molecules in the 
. human genome compared with the other eu- 
, karyotic , genomes, Jt -should ^be . noted that 
: . most of the; protein idpniains are highly con- 
: -^^'^^^v A° ^interesting j observation is that 
. -worms and humans have -approximately the 
' . .same nuniber ;t>f both ^tyrosine kinases and 
; serine/threonine kinases (Table 19). It is im- 
portant to notCj however^ that these are mere- 
Jy counts of the catalytic domain; the proteins 
• - that contam these domains also - display a 
: /:\yide. repertoire of interaction d^ with" 
V . sigm'ficant combinatorial diversity, 
r f^*>;;Hemostasis;%Hem^^^ is regulated pri- 
marily by plasma proteases of the coagulation 
pathway and by the interactions that occur be- 
tween the vascular endothelium and platelets. 
Consistent with known anatomical and physio- 
logical differences between vertebrates and in- 
vertebrates, extracellular adhesion domains that 
constitute proteins integral to hemostasis are 
expanded in the human relative to the fly and 
womi (Tables 18 and 19). We note the evolu- 
tion of domains such as FIMAC, FNl, FN2, 
and Clq that mediate surface interactions be- 
tween hematopoeitic cells and the vascular ma- 
trix. In addition, there has been extensive re- . 
cniitment of more-ahcient animal-specific do- 
mains such as VWA, VWC, VWD, kringle, 
and FN3 into multidomam proteins that are 
involved in hemostatic regulation. Although we 
do not find a laige expansion in the total num- 
ber of serine proteases, this enzymatic domain 
has been specifically recruited into several of 
these multidomain proteins for proteolytic reg- 
ulation in the vascular compartment These are 
represaited in plasma proteins that belong to 
the kinin and coniplement pathways. There is a 
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man expansions has occurred in certain fam- . -.thesis; for example L13a and the relateH T 7 r ^''^'^'^ ^ ^ <=0'nple- 

: iHes>vo^eJMJetr^^^^^ 

We Identified 28.different ribpsomal subunits A-Jshown to induce aooDtosis f/-^^^ expressed ebl- I A. (i-^d). . ■ . 

;Vthat each have kt least aO copies in the^ fee:^-^V^^%^M^S M^>v - i^v>«^./f!i??»Vc%^wto 

^ ^vnome; on average. fbr^SSfiJSi^f^ 
,there.i, about an ^-.to lO^ldexp^ion^^^^^^ 



;: ;the number of genes ;:relative to ; either /fte :^:^.expansions li^^^^ 

; ./worm or fly, Jletrotransposed. pseudogenes K;Vlogs;that:have presumably arisen from retro- 
Table 19, (Co/?t/Vji/ec/) 



• Panther family/subfamily* 



H 



W 



MHCdassI 

MHC class II 

Other immunoglobulint 

Toll receptor-f elated 



22 
20 
114 
10 



0 
0 
0 
6 



0 
0 
0 
0 



; SIgnaltng moleculest 
Calcitonin 
Ephrin 
FCF 

Glucagon 

Glycoprotein hormone beta chain 
Insulin 

Insulin-like hormone 

Nerve growth. factor 

Neuregulin/heregulin 
. neuropeptide Y 
- -PDGF 
Relaxin 
Stannocalcin 
Thymopoeitin 
Thyomosin beta 
TGF-p 
VECF . 

Wnt- . - - ^ 

Receptorst 
Ephrin receptor 
FGF receptor 
Frizzled receptor 
Parathyroid hormone receptor 
VEGF receptor 

BDNF/NT-3 nerve growth factor 
receptor 

Dual-specificity protein phosphatase 
S/T and dual-specificity protein 

kinasef • 
S/T protein phosphatase 

Y protein klnasef 

Y protein phosphatase 

ARF family . 

Cyclic nucleotide phosphodiesterase 

C protein-coupled receptorstt 

G-proteIn alpha 

G-protein beta 

G-protein gamma 

Ras superfamily 

C-protein modulatorsf 

ARF CTPase-activatIng 

Neurofibromin 

Ras CTPase-activatIng 

Tuberin 

Vav protd-oncogene family 



Developmental and homeostatk regulators 



3' 

8 
24 

4 

2 

1 

3 

3 

6 

4 

1 

3 

2 

2 
4 
29 
4 
18 - 

12 

4 
12 

2 

5 

4 



0 
2 
1 
0 
0 
0 
0 
0 
0 
. 0 

1 

0 
0 
0 
2 
6 
0 

6 " 

2 
4 
6 
0 
0 
0 



0 
4 
1 

•0 
0 
0 
0 
0 
0 
0 
0 
0 
0 

1 

0 
4 
0 

1 
0 
5 
0 
0 
0 



Kinases and phosphatases 



0 
0 
0 
0 



0 

0 

0 

0 
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diversity in an • prgsuus^ -protein xbmpie- . 
: ;ment. vWe haye ^identified. 2(59; Igenes for ri-; 
i .^bonucleoproteins.: This}r 2.5 
■ times the number of ribonucleoprotein genes 
• in the womi, two 'times that, of the fly, and 
-.^about the. same^-as..the..,265 identified m.the 
lArabidopsis .genome. ^Whether" the . diversity 
of .ribonucleoprotein genes , in humans con- 
. tributes to gene regulation at either the splic- 
ing .or, translational level is unknown. 

Posttranslational modifications. In this 
set of processes, the , most prominent expan- 
sion is the transglutaminases, calcium-depen- 
dent enzymes that catalyze the cross-linking 
. of proteins in cellular processes such as he- 
mostasis and apoptosis (147). The vitamin 
K-dependent gamma carboxylase gene prod- 
uct acts on the GLA domain (missing in the 
. fly and womi) found in coagulation factors, 
' osteocalcin, and matrix GLA .protein (148), 
. Tyrosylprotein isulfotransferases : participate . 
: V in the.' posttranslational modification of pro- . 
teins. involved in inflammation, and hemosta- 
. sis, including coagulation factors and chemo- 
Idne receptors. (7^P). Although there is no 
significant numerical increase in the counts 
for domains involved in nuclear protein mod- 
; ^ number of domain ar- 

rangements in the predicted human proteins 
that are not . found in the other currently se- 
quenced genomes. These include the tandem 
association of two histone deacetylase do- 
mains in HD6 with a ubiquitin finger domain, 
a feature lacking in the fly genome. An ad- 
ditional example is the co-occurrence of im- 
portant nuclear regulatory enzyme PARP 
(poly-ADP ribosyl transferase) domain fused 
to protein-interaction domains— BRCT and 
VWA in himians. 

Concluding remarks. There are several 
possible explanations for the differences in 
phenotypic complexity observed in humans 
when compared to the fly and worm. Some of 
these relate tp the. prominent differences in 
the immune system, Jiemostasis, neuronal, : 
vascular, and cytoskeletal complexity. The 
finding that the hiunan genome contains few- 
er genes than previously predicted might be 
compensated for by combinatorial diversity 
generated at the levels of protein architecture, 
transcriptional and translational control, post- 
translational modification of proteins, or 
posttranscriptional regulation. Extensive do- 
main shuffling to increase or alter combina- 
torial diversity can provide an exponential 
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. . quence .well into centromeric regions and air 
; .lowed high-quality resolution of complex re- 
:;. -'peat regions. Likewise, in Drosophila,/(he 
'] BAC physical map was most useful in. re- 
;^';gions near the highly repetitive centromeres 

■ and telomeres. WGA.has been found to .de-- 
; . iliyer, excellent-quality reconstructions of the 
^ iinique regions of the genome. As the genome 
;;/size, and more importantly the repetitive con- 
; tent, increases, the WGA approach delivers 
■ : less of the repetitive sequence. 

. .The cost and overall efficiency of cione-by- 
;:. clone approachesmakes them difficult to justify 
: :: as a stand-alone strategy for future large-scale 

genome-sequencing projects. Specific applica-. 

. tions .of BAC-based or other clone mapping and 
. sequencing strategies to resolve ambiguities in. 
. sequence" assembly that.cannot be efficiently; 

resolved, with computational approaches alone 

are clearly worth exploring. Hybrid approaches 
• to whole-genome sequencing will only work if 

there is sufficient coverage in both the whole- 
. genome shotgun phase and the BAC clone se- 

quencmg phase.. Our ejqjerience with human 

genome assembly suggests that this will require 
. at least 3 X coverage of both whole-genome and 

BAC shotgun sequence data. 



8.2 The low gene number In humans 

We have sequenced and assembled ^95% of 
. the euchrpmatic sequence of H. sapiens and 
: 1 used a new automated gene prediction meth-, 
[. . od to produce a preliminary catalog of the 
human genes. This has provided a major sur-. 
prise: We have found far fewer genes (26,000 
to 38,000) than the earlier molecular pre- 
dictions (50,000 to over 140,000). Whatever 
the reasons for this current disparity, only 
detailed annotation, comparative genomics 
(particularly using the Mus .musculus ge- , 
nome), and careful molecular dissection of 
complex phenotypes will clarify this critical 
issue of the basic "parts list" of our genome. 
Certainly, the analysis is still incomplete and 
considerable refinement will occur in the 
years to come as the precise structure of each 
transcription unit is evaluated. A good place 
to start is to determine why the gene esti- 
mates derived from EST data are so discor- 
dant with our predictions. It is likely that the 
. following contribute to an inflated gene num- 
ber derived firom ESTs: the variable lengths 
of 3'- and 5 '-untranslated leaders and trailers; 
the little-understood vagaries of RNA pro- 
cessing that often leave intronic regions in an 
unspliced cocidition; the fmding that neai-ly 
40% of human genes are alternatively spliced 
(/5i); and finally, the imsolved technical 
problems in EST library construction where 
contamination from heterogeneous nuclear 
RNA and genomic DNA are not imcommon. 
Of course, it is possible that there are genes 
that remam unpredicted owing to the absence 
of EST or protein data to support them, al- 
though our use of mouse genome data for 



predicting genes should lirhit this number. As 
I :> ; was true at the beginning of genome sequenc- 
\.:ing, ultimately.it will be necessary. to measure 
ii.;;:mRNA:in .specific cell types to demonstrate 
, . the presence of a gene. . 
. J. B..S. Haldane speculated in 193.7:that a 
rtV.popuIation of organisms might have to. pay a 
.i-i, / price for ithe.numberv bf genes ■ it; caii. possibly . 

A cany.: He: ■ theorized /that when .the-number .of 
;./..\;:genes .becomes .too.jarge,*: each, zygote carries ^ 
.> .{SO many .new deleterious mutatioris .that the 
\ ■ ;. population simply cannot:mamtam itself On 
:. ;: the basis .of this premise, and on the basis of 
; available mutation rates .and x-ray-induced. 
■ mutations at specific loci, Muller, m. \961 
... (75^), :. calculated -that ;the .liiammalian ge- 
; .nome wo^ld contain a maximum of not much i 
; ;^more tihan 30,000 genes (755):';An estimate of 
30,000 gene loci for humans was also arrived; 
;;:.at by Crow and Kimura (yj5).:Muller's esti- v 
^r i.mate for Z). me/a/iO'ga?/er^was. 10,000 genes/ : 
.^-.compared to 13,000 derived by; annotation of^- 
; ; the fly genpnie (2tf; 27). -These arguments for; ^ 
the theoretical maximiun gene number were 
based on simplified ideas of genetic load — 
; that . all genes /have a . certain low . rate of 
.. .mutation to a deleterious state. However, it is 
clear that many mouse, fly, worm, and yeast 
. knockout mutations lead to almost no dis- 
' cemible.phenotypic perturbations. : 
: . . / The , . modest .- number ^of : human . genes : 
.• means that;;we must look, else^vhere' for the;i ' 
. :<; -mechaiiisrris ..that generate, .the coinplexities .-. ; 
V;.. inherent. in human development and the so- '> 
phisticated signaling systems that maintain - 
homeostasis. ..There are a large number of 
■ ways in which the functions of individual : 
genes and gene products are regulated. The , 
- degree of "openness" of chromatin structiure -:; 
;.:;and hence transcriptional activity is regulated 
y^by .protein complexes that- involve - histone ' 
and DNA enzymatic modifications. We enu- 
merate many of the proteins that are likely 
involved m nuclear regulation m Table 19. 
The location, timing, and quantity of tran- 
scription are intimately linked to nuclear sig- 
nal . transduction everits as well as by the 
tissue-specific expression of many of these 
proteins. Equally, important are regulatory 
DNA elements that include insulators, re- 
peats, and endogenous viruses (7J7); meth- 
ylation of CpG islaiids in imprinting {}58)\ 
and promoter-enhancer and iritronic regions 
that modulate transcription. The splicepsomal 
machinery cohsists of multisubimit proteins 
(Table 19) as well as structural and catalytic 
RNA elements {159) that regulate transcript 
structure through alternative start and termi- 
nation sites and splicing. Hence, there is a 
need to study different classes of RNA mol- 
ecules {160) such as small nucleolar RNAs, 
antisense riboregulator RNA, RNA involved 
in X-dosage compensation, and other struc- 
tural RNAs to appreciate their precise role in 
regulating gene expression. The phenomenon 



Sr of: RNA -editingVi^ coding changes 

:v:pccu^::directlyvat,;the;;leyel of mRJsT A is of 
clinical and biological relevance {161), Final- 
Iy»' examples of translational control include 
V - internal ribosomal eiitjy sites that are found 
7;,;in. proteins :iiivolved cell cycle .regulation 

and . apoptosis :'.(752): VAt . the :;protein level, ' . 
; .tMninqr.; alterations .;iii.\tiie ^nature ):6f; protein-. 
^■^protein .^intexactionsjtjprptein^inodificatio^ 
•;:^.and localizatioii . can hayie'dramafic. effects on '•' 
; xellular.physiology (7^5):^ dynamic sys- 
•:.tem therefore has ; many , ways to :modulate 
.. activity, which .suggests . that ^definition of 

• complex systems by analysis of single genes - 
•As imlikely to be entirely .successful. 
: ; '. .'•In. situ studies have, shown that the human 
. .. genome / is . asyminebrically. ^ popxilated ' .with • 
:;:G4^C;conterit,;CpiG;islaiids^*and genes 
V.However;-the' geries are not ^distributed quite 

• as tme^uallyias had been- predicted (Table 9) 
(5P).;-The most G+C-rich fraction of the ge- 

vv,/ nome, .H3,:ispchores,.. constitute of the. 
■ genome :than previously thought (about 9%), ' 
; ; and are • the ■ most gene-dense , fraction, but 
. . contain only 25% of the genes, rather than the 
; predicted -40%. The low G+C L isochores 
. make up 65% of the genome, and 48% of the 
-genes. This inhomogeneity, the net result of 
. millions of years of mammalian gene dupli- 
r cation, has been .described as; the ."desertifi- . : 

cation"; of the, vertebrate; genome :(77). Why 
:;V;are, there .clusteresd Tegions of /high arid low 
j: gene density,' and are. these, accidents of his- 
■Vtory or driven by selection and evolution? If 
these deserts are dispensable, it ought to be 
. possible to find mammalian genomes that are 
. . far smaller in. size than the human genome. 

Indeed, many , species of bats have genome 
.r.sizes .that are much .smaller than that of hu- 
;^ mans; for . exainple, j^^^ a species of 

-Italian ':bat, . has . a genome' size that is only 
50% that of humans {164), Similarly, Mun- 
tiacus, a species of Asian barking deer, has a 
genome size that is '--70% that of humans. 

8.3 Human DNA sequence variation 
and its distribution across the genome 

This is the first eukaryotic genome in which a 
nearly uniform ascertainment of polymorphism 
has been completed. Although we have identi- 
fied and mapped more than 3 milHon SNPs, this 
by no means implies that the task of finding and 
cataloging SNPs is complete. These represent 
only a faction of the SNPs present in the 
human population as a whole: Never&eless, 
this first glimpse at genome-wide variation has 
revealed strong inhomogeneities in the distribu- 
tion of SNPs across the genome. Polymorphism 
in DNA carries with it a snapshot of the past 
operation of popuJation genetic forces, includ- 
ing mutation, migration, selection, and genetic 
drift. The availability of a dense anay of SNPs 
will allow questions related to each of these 
factors to be addressed on a genome-wide basis. 
SNP studies can establish the range of haplo- 
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, nome would open up new strategies for hu- 
^^^an biological research and would have a 
^ ^^^najor impact on medicine, and through med- 
• ; icine and public health, on society. Effects on 
; biomedical research are ah*eady being felt. 
. This assembly of the human genome se- 
. ; : quence Js. but a first, hesitant, step on a long. 
- r.and ;,exciting-.joiraey J toward , 
. ' the role of the genome in human biology! It 
has been possible only because ' of innova- ^ 
tions . in instrumentation .and software, that 
' have allowed automation of ahnost every step 
. of the process from DNA preparation to an- 
. . . . notation. The next steps are clear: We must 
. . define the complexity that ensues when this 
relatively modest set of about 30,000 genes is 
expressed. The sequence provides the frame- 
work upon which all the genetics, biochem- * 
• , istiy, physiology, and ultimately phenotype ' 
depend. It provides the boundaries for scien- ; 
. tific inquiiy. The sequence is only the first 
level of. understanding of the genome. All 
. genes and their control elements . must be . 
identified; their functions, in concert as well 
as in isolation, defined; their sequence varia- ' 
tion worldwide described; and the relation 
_ . between genome variation and specific phe- 
notypic characteristics determined. Now we 
know what we have to explain. 

Another paramount challenge awaits: 
^^iblic discussion of . this information and its 
^P^tential for improvement of personal health. . : 
..Many diverse sources of data have shown 
that any two individuals are more than 99.9% 
identical in sequence, which means that , all 
the glorious differences among individuals in 
our species that can be attributed to genes 
falls in a mere 0.1% of the sequence. There 
are two fallacies to be avoided: determinism, 
. the idea that all characteristics of the person 
, are **hard-wired" by the genome; and reduc- 
tionism, the view that with complete knowl- 
edge of the human genome sequence, it is 
only a matter of time before oxir understand- 
ing of gene functions and interactions will 
provide a complete causal description of hu- 
man variability. The real challenge of human 
biology, beyond the task of finding out how 
genes orchestrate the construction and main- 
tenance of the miraculous mechanism of our 
bodies, will lie ahead as we seek to explain 
how our minds have come to organize 
thoughts sufficiently well to investigate our 
own existence; 
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cipitated and dissolved in 1 ml TE buffer. To make 
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poUshed with consecutive BAL31 nuclease and T4 
DNA polymerase treatments, and size-selected by 
electrophoresis on 1% low-melting-point agarose. 
After ligation to Bst XI adapters (Invitrogen. catalog 
no, N408-18). DNA was purified by three rounds of 
• gel electrophoresis to remove excess adapters, and 
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..^ inserted, into Bst:XI-Iineari2ed plasmid vector with 
.'..; 3'-TCTC overhangs: Libraries with three different 
- -t .v average sizes of inserts were constructed; 2. 10, and 
, .' . 50 kbp. '.The. 2-kbp fragments \were cloned In a 
; .v <:high-copy.pUCl 8 derivative. The 10- and 50-kbp 
fragments were cloned In a. medium-c^py pBR322 
; ..' derivative. The 2- and '10- kbp. libraries yielded uni- 
: .; V form^sized large colonies on plating. .However, the 
: . : , SOrkbp. libraries, produced man^^smalL.colohies. and 
vV inserts. were , unstable. •;To, remedy. this.Uhe j5b-kbp 
.>;.^,,,t;libraries.were,digested;with Bgl ll.'W 
... , J. cleave : the ;:Vector.', but generally /cleaveid : several 
. . . : times within the 50^kbp.insert.'A"1264^bp' Bam HI ' 
.ykanamycin : resistance - cassette "/(puHfied .:from 
- ' pUCK4; Amersham Phanmacia, catalog nb. 27-4958- 
. ... 01) was added and ligation was carried out at 37*C 
in the continual presence of Bgl II. As 'Sgl II-Bgl )| 
ligations occurred, they were continually, cleaved. . 
-whereas Bam m-Bgl II. ligations were not cleaved. A 
. . high yield oMntemally. deleted circular, library mol-. 
... ecules . was obtained :ln which ; the residual Insert 
ends...were .separated. ,by the .-kanamycin: cassette 
• DNA. The; internally 'deleted libraries, when plated ^ 
■ . . .,on agar containing ampicillin (50 jtg^ml). carbenl- 
; V-V .^ cillin (50 |ig/ml), and kanamydn (15 fi^ml)..pro- • 
, , ,.;.duced relatively unifomi large colonies. The result- ' 
: ing clones could be prepared.for sequencing using 
:\ . the same procedures as clones from the 10-kbp 
libraries. 

. 34. . Transfonmed . cells were plated on .agar diffusion 
; - plates prepared with a fresh top layer containing no 
. antibiotic poured on top of a. previously set bottom . 
■ layer . containing excess antibiotic, to achieve the 
con-ect final concentratioa.This method of plating 
permitted the cells to develop antibiotic resistance 
before being exposed to antibiotic without the po- 
tential clone bias that can be Introduced through . 
liquid outgrowth protocoU. • After colonies had 
., grown. QBot (Genetix. UK) automated.colony-plck- 
, ing robots were used to pick colonies meeting strin- • - 
gent size and shape criteria and to. Inoculate 384- 
:\- well.microtiter plates containing liquid growth me- . 
.r,'- dium.. Liquid, cultures were .Incubated overnight. - 
.. with shaking, and were, scored for growth before 
. passing to template preparation. Template DNA was 
^ extracted from liquid bacterial culture using a pro- 
- , cedure based upon the alkaline lysis minlprep rrieth- 
od (773) adapted for high throughput processing In 
384-well microtiter plates. Bacterial cells were 
lysed: cell debris was removed by centrifugatlon; 
and plasmid DNA was Vecovered by Isopropanol 
. .-v precipitation , and /esuspended in . io mM tris-HCl 
,: : buffer.- Reagent dispensing operations were accom- 
plished using TItertek MAP 8 liquid dispensing sys- 
tems. Plate-to-plate liquid transfers were performed 
using Tomtec Quadra 384 Model 320 pipetting ro- 
bots. All plates were tracked throughout processing 
by unique plate barcodes. Mated sequencing reads 
from opposite ends of each done Insert were ob- 
tained by preparing two 384-well cycle sequencing 
reaction plates from each plate of.plasmid template 
DNA using ABI-PRISM BigDye Temiinator chemistry 
(Applied Biosystems) and standard M13 forward 
and reverse primers. Sequencing reactions were pre- 
pared using the Tomtec Quadra 384-320 pipetting • 
robot Parent-child plate relationships and, by ex- 
tension, forward-reverse sequence mate pairs were 
established by automated plate barcode reading by 
tlie onboard barcode reader and were recorded by 
direct UMS communicatioa .Sequencing reaction . ' 
producU were purified by alcohol precipitation and . 
were dried, sealed, and stored at 4"C In the dark 
until needed for sequencing, at which time' the 
reaction products were resuspended In delonlzed 
formamide and sealed Immediately to prevent deg- 
radation. All sequence data were generated using a 
single sequencing platform, the ABI PRISM 3700 
DNA Analyzer. Sample sheets were aeated at load 
time using a Java-based appUcation that facilitates 
barcode scanning of the sequencing plate barcode, 
retrieves sample Infomiation from the central UMS, 
and reserves unique trace identifiers. The applica- 
tion permitted a single sample sheet file In the 
linking directory and deleted previously created 
sample sheet files Immediately upon scanning of a 
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share at least one significant BLAST hit in commoa 
. .This is an especially interesting property of the 
• . • metric because it allows the rapid recoveiy of pror . • 
tein families from the proteome for which no mul- -J 
*tiple alignment is possible, thus providing a compu- 
•» tatiohal basis for the extension of protein homology . 
, / searches beyond those of current HMM- and profile- : 
, based search methods. Once the whole-proteome =: 
similarity matrix has been calculated, Lek first par-*^ ; 
titions.the proteome : Into single-linkage clusters/-; 
•:' (27) on.the basis of one or.mpre shared.BlAST hits : s 
L.rv . between two sequences.. Next, these single-linkage . i] 
• - ' .clusters, are further., partitioned . into .subclusters.U-. 
; ■ . .each member.of which shares a user-specified pair- 
r i wise similarity with the other members of the clus- y^. 
"'.'ter, as. described above. For, the purposes of this v 
; ' r. publicatioa we .have . focused - on ,the/ analysis of ; 
.- single-linkage, clusters and what: we have termed - 
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. ; which every:member has a similarity metric of 1 to - 
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complete dusters, because it is impossible to place 
. a unique multidomain protein into a complete clus- 
. ter. Thus, the! single-linkage and cornplete dusters . 
. plus singletons should comprise a lower and upper 
bound of sizes of core protein sets, respectively, 
allownng us to compare the relative size and com- 
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